<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7655.1">
<TITLE>RE: [torqueusers] Torque environment problem</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->
<P><FONT SIZE=2>Hi,<BR>
<BR>
I have recompiled openmpi-1.4.3 with tm support. <BR>
<BR>
I have confirmed that it is available via:<BR>
<BR>
[rsvancara@node1 ~]$ ompi_info |grep plm<BR>
MCA plm: rsh (MCA v2.0, API v2.0, Component v1.4.3)<BR>
MCA plm: slurm (MCA v2.0, API v2.0, Component v1.4.3)<BR>
MCA plm: tm (MCA v2.0, API v2.0, Component v1.4.3)<BR>
<BR>
[rsvancara@node164 ~]$ ompi_info |grep plm<BR>
MCA plm: rsh (MCA v2.0, API v2.0, Component v1.4.3)<BR>
MCA plm: slurm (MCA v2.0, API v2.0, Component v1.4.3)<BR>
MCA plm: tm (MCA v2.0, API v2.0, Component v1.4.3)<BR>
<BR>
<BR>
When I launch jobs using openmpi, I have to use:<BR>
<BR>
-mca plm rsh<BR>
<BR>
If I set this to<BR>
<BR>
-mca plm tm<BR>
<BR>
Then no remote processes are launched. I do not mind using rsh, however, I would prefer to have torque "Do the right thing" and just work. I am using torque version 2.4.7.<BR>
<BR>
Is this a torque/openmpi compatibility issue? Or is this how torque is supposed to work with openmpi? I thought torque would launch the remote processes and clean them up after.<BR>
<BR>
I would appreciate any suggestions. <BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
-----Original Message-----<BR>
From: torqueusers-bounces@supercluster.org on behalf of Gustavo Correa<BR>
Sent: Sat 3/19/2011 5:48 PM<BR>
To: Torque Users Mailing List<BR>
Subject: Re: [torqueusers] Torque environment problem<BR>
<BR>
Hi Randall<BR>
<BR>
If you build OpenMPI with Torque support, mpiexec will use the<BR>
nodes and processors provided by Torque, and you don't<BR>
need to provide any hostfile whatsoever.<BR>
We've been using OpenMPI with Torque support for quite a while.<BR>
<BR>
To do so, you need to configure OpenMPI this way:<BR>
<BR>
./configure --prefix=/directory/to/install/openmpi --with-tm=/directory/where/you/installed/torque<BR>
<BR>
See the OpenMPI FAQ about this:<BR>
<A HREF="http://www.open-mpi.org/faq/?category=building#build-rte-tm">http://www.open-mpi.org/faq/?category=building#build-rte-tm</A><BR>
<BR>
Still, although your script to restore the "np=$NPROC" syntax is very clever,<BR>
I guess you could use directly the $PBS_NODEFILE as your hostfile,<BR>
when OpenMPI is not built with Torque support.<BR>
<BR>
The issue with LD_LIBRARY_PATH may be in addition to the nodefile mismatch<BR>
problem you had.<BR>
OpenMPI requires both PATH and LD_LIBRARY_PATH to be set on all hosts<BR>
where the parallel program runs:<BR>
<BR>
<A HREF="http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path">http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path</A><BR>
<BR>
If your home directory is NFS mounted on your cluster,<BR>
the easy way to do it is to set both in your .bashrc/.cshrc file.<BR>
<BR>
I hope this helps.<BR>
Gus Correa<BR>
<BR>
<BR>
On Mar 19, 2011, at 8:15 PM, Svancara, Randall wrote:<BR>
<BR>
> Hi<BR>
><BR>
> I did figure out the issue, or at least I am on the path to a solution.<BR>
><BR>
> I was assuming that when I submit a job via torque with the PBS parameter: #PBS -l nodes=12:ppn=12 that the PBS_NODEFILE parameter would have the correctly formatted hosts file for openmpi.<BR>
><BR>
> What I am seeing is that torque will generate a hosts file that looks like this:<BR>
><BR>
> node164<BR>
> node164<BR>
> node164<BR>
> node164<BR>
> node164<BR>
> node164<BR>
> node164<BR>
> node164<BR>
> node164<BR>
> node164<BR>
> node164<BR>
> node164<BR>
> node164<BR>
> node163<BR>
> node163<BR>
> node163<BR>
> node163<BR>
> node163<BR>
> node163<BR>
> node163<BR>
> node163<BR>
> node163<BR>
> node163<BR>
> node163<BR>
> node163<BR>
> ....<BR>
><BR>
><BR>
> But from what I can see, openmpi expects a hostfile list likethis:<BR>
><BR>
> node164 slots=12<BR>
> node163 slots=12<BR>
><BR>
> So what I had to do in my script is add the following code:<BR>
><BR>
> np=$(cat $PBS_NODEFILE | wc -l)<BR>
><BR>
> for i in `cat ${PBS_NODEFILE}|sort -u`; do<BR>
> echo $i slots=12 > /home/admins/rsvancara/nodes<BR>
> done<BR>
><BR>
> /usr/mpi/intel/openmpi-1.4.3/bin/mpiexec $RUN $MCA -np $np -hostfile /home/admins/rsvancara/nodes /home/admins/rsvancara/TEST/mpitest<BR>
><BR>
> I guess I was expecting openmpi to do the right thing but apparently torque and openmpi are not on the same page in terms of formatting for a hosts file. I am using version 2.4.7 of torque. Would newer versions of torque correctly generate a hosts file?<BR>
><BR>
> The strange thing is that why would openmpi just simply tell me it may be a LD_LIBRARY_PATH problem seems rather vague. A better response would be "What the .... am I supposed to do with this hosts file you idiot, please format it correctly". <BR>
><BR>
> Best regards<BR>
><BR>
> Randall<BR>
><BR>
><BR>
><BR>
><BR>
><BR>
><BR>
><BR>
> -----Original Message-----<BR>
> From: torqueusers-bounces@supercluster.org on behalf of Shenglong Wang<BR>
> Sent: Fri 3/18/2011 8:45 PM<BR>
> To: Torque Users Mailing List<BR>
> Subject: Re: [torqueusers] Torque environment problem<BR>
><BR>
> Have you set LD_LIBRARY_PATH in your ~/.bashrc file? Did you try to include LD_LIBRARY_PATH to mpirun or mpiexec?<BR>
><BR>
> np=$(cat $PBS_NODEFILE | wc -l)<BR>
><BR>
> mpiexec -np $np -hostfile $PBS_NODEFILE env LD_LIBRARY_PATH=$LD_LIBRARY_PATH XXXX<BR>
><BR>
> Best,<BR>
><BR>
> Shenglong<BR>
><BR>
><BR>
><BR>
><BR>
> On Mar 18, 2011, at 11:36 PM, Svancara, Randall wrote:<BR>
><BR>
> > I just wanted to add that if I launch a job on one node, everything works fine. For example in my job script if I specify<BR>
> ><BR>
> ><BR>
> > #PBS -l nodes=1:ppn=12<BR>
> ><BR>
> > Then everything runs fine.<BR>
> ><BR>
> ><BR>
> > However, if I specify two nodes, then everything fails.<BR>
> ><BR>
> ><BR>
> > #PBS -l nodes=1:ppn=12<BR>
> ><BR>
> > This also fails<BR>
> ><BR>
> ><BR>
> > #PBS -l nodes=13<BR>
> ><BR>
> > But this does not:<BR>
> ><BR>
> ><BR>
> > #PBS -l nodes=12<BR>
> ><BR>
> > Thanks,<BR>
> ><BR>
> > Randall<BR>
> ><BR>
> > -----Original Message-----<BR>
> > From: torqueusers-bounces@supercluster.org on behalf of Svancara, Randall<BR>
> > Sent: Fri 3/18/2011 7:48 PM<BR>
> > To: torqueusers@supercluster.org<BR>
> > Subject: [torqueusers] Torque environment problem<BR>
> ><BR>
> ><BR>
> > Hi,<BR>
> ><BR>
> > We are in the process of setting up a new cluster. One issue I am experiencing is with openmpi jobs launched through torque.<BR>
> ><BR>
> > When I launch a simple job using a very basic mpi "Hello World" script I am seeing the following errors from openmpi:<BR>
> ><BR>
> > **************************<BR>
> ><BR>
> > [node164:06689] plm:tm: failed to poll for a spawned daemon, return status = 17002<BR>
> > --------------------------------------------------------------------------<BR>
> > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to<BR>
> > launch so we are aborting.<BR>
> ><BR>
> > There may be more information reported by the environment (see above).<BR>
> ><BR>
> > This may be because the daemon was unable to find all the needed shared<BR>
> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the<BR>
> > location of the shared libraries on the remote nodes and this will<BR>
> > automatically be forwarded to the remote nodes.<BR>
> > --------------------------------------------------------------------------<BR>
> > --------------------------------------------------------------------------<BR>
> > mpirun noticed that the job aborted, but has no info as to the process<BR>
> > that caused that situation.<BR>
> > --------------------------------------------------------------------------<BR>
> > --------------------------------------------------------------------------<BR>
> > mpirun was unable to cleanly terminate the daemons on the nodes shown<BR>
> > below. Additional manual cleanup may be required - please refer to<BR>
> > the "orte-clean" tool for assistance.<BR>
> > --------------------------------------------------------------------------<BR>
> > node163 - daemon did not report back when launched<BR>
> > Completed executing:<BR>
> ><BR>
> > *************************<BR>
> ><BR>
> > However when launch a job running mpiexec, everything seems to work fine using the following script:<BR>
> ><BR>
> > /usr/mpi/intel/openmpi-1.4.3/bin/mpirun -hostfile /home/admins/rsvancara/hosts -n 24 /home/admins/rsvancara/TEST/mpitest<BR>
> ><BR>
> > The job runs on 24 nodes with 12 processes per node.<BR>
> ><BR>
> > I have verified that my .bashrc is working. I have tried to launch from an interactive job using qsub -I -lnodes=12:ppn12 without any success. I am assuming this is an environment problem, however, I am unsure as the openmpi error includes "MAY".<BR>
> ><BR>
> > My question is:<BR>
> ><BR>
> > 1. Has anyone had this problem before (I am sure they have)<BR>
> > 2. How would I go about troubleshooting this problem.<BR>
> ><BR>
> ><BR>
> > I am using torque version 2.4.7.<BR>
> ><BR>
> > Thanks for any assistance anyone can provide.<BR>
> ><BR>
> ><BR>
> > _______________________________________________<BR>
> > torqueusers mailing list<BR>
> > torqueusers@supercluster.org<BR>
> > <A HREF="http://www.supercluster.org/mailman/listinfo/torqueusers">http://www.supercluster.org/mailman/listinfo/torqueusers</A><BR>
><BR>
><BR>
><BR>
> _______________________________________________<BR>
> torqueusers mailing list<BR>
> torqueusers@supercluster.org<BR>
> <A HREF="http://www.supercluster.org/mailman/listinfo/torqueusers">http://www.supercluster.org/mailman/listinfo/torqueusers</A><BR>
<BR>
_______________________________________________<BR>
torqueusers mailing list<BR>
torqueusers@supercluster.org<BR>
<A HREF="http://www.supercluster.org/mailman/listinfo/torqueusers">http://www.supercluster.org/mailman/listinfo/torqueusers</A><BR>
<BR>
</FONT>
</P>
</BODY>
</HTML>