[Mauiusers] Problem with Torque/Maui
S Ranjan
sranjan at ipr.res.in
Wed Jan 24 18:15:53 MST 2007
Garrick Staples wrote:
>On Thu, Jan 25, 2007 at 05:37:56AM +0530, S Ranjan alleged:
>
>
>>Garrick Staples wrote:
>>
>>
>>
>>>On Wed, Jan 24, 2007 at 08:51:11AM +0530, S Ranjan alleged:
>>>
>>>
>>>
>>>
>>>>Hi
>>>>
>>>>I have torque pbs_server running on the headnode, which is also the
>>>>submit host. There are 32 other compute nodes, mentioned in
>>>>/var/spool/torque/server_priv/nodes file. There is a single queue at
>>>>present. Sometimes, mpi jobs requesting for 28/30 nodes, land up
>>>>running on the head node, though the head node is not a compute node at
>>>>all. netstat -anp shows several sockets being openend for the job, and
>>>>eventually the head node hangs up.
>>>>
>>>>Appreciate any help/suggestion on this.
>>>>
>>>>
>>>>
>>>>
>>>Which MPI? MPICH? I'd guess mpirun is using the default machinefile
>>>that is created when mpich is built, and not the hostlist provided by
>>>the PBS job.
>>>
>>>Run mpirun with "-machinefile $PBS_NODEFILE" or use OSC's mpiexec
>>>instead of mpirun: http://www.osc.edu/~pw/mpiexec/
>>>
>>>_______________________________________________
>>>mauiusers mailing list
>>>mauiusers at supercluster.org
>>>http://www.supercluster.org/mailman/listinfo/mauiusers
>>>_____________________________________________________________________
>>>
>>>The mail server at Institute for Plasma Research has scanned this
>>>email for Virus using ClamAV 0.88.4
>>>_____________________________________________________________________
>>>
>>>
>>>
>>>
>>>
>>>
>>We are using Intel mpi 2.0. We are using mpiexec -n 28 ......
>>inside the pbs script.
>>However, for mpdboot (executable in the mpi 2.0 binary dir), we are
>>running it before running the pbs script. The exact syntax being used is
>>
>>mpdboot -n 32 -f mpd.hosts --rsh=ssh -v
>>
>>mpd.hosts file, residing in the user's home directory, contains the
>>names of the 32 compute nodes (excluding the head node).
>>
>>
>
>There is your problem, you want to use the list of nodes assigned to
>your job. So you'll want something like this:
> np=$(wc -l < $PBS_NODEFILE)
> mpdboot -n $np -f $PBS_NODEFILE --rsh=ssh -v
>
>But I still recommend using OSC's mpiexec instead.
>
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>mauiusers mailing list
>mauiusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/mauiusers
>
>
Actually for us, mpd.hosts and $PBS_NODEFILE contain the same set of
machines - the 32 compute nodes. However, since our headnode is also
the submit host, mpdboot actually needs to be started with -n33, instead
of -n 32.
This is because, mpdboot starts on the headnode anyway, and using -n
32, actually decreases the compute nodes by one, as the mpdboot counting
goes as <headnode> + 31 <compute nodes>
I'll download OSc's mpiexec and try the -machinefile option - I presume
it is compatible with INTEL mpi 2.0's mpiexec, because the dynamic
libraries loaded would be from INTEL's library.
Thanks
Sutapa
More information about the mauiusers
mailing list