<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7655.1">
<TITLE>RE: [torqueusers] Torque environment problem</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->
<BR>
<P><FONT SIZE=2>Hi,<BR>
<BR>
Thanks for the quick reply. <BR>
<BR>
Here is my LD_LIBRARY_PATH:<BR>
<BR>
LD_LIBRARY_PATH=/usr/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib<BR>
<BR>
I am using modules, so I am not sure if that is causing me any issues:<BR>
<BR>
. /home/software/Modules/default/init/bash<BR>
. /home/software/modulefiles/.defaultmodules<BR>
module add null intel/11.1.075 openmpi/1.4.3_intel<BR>
<BR>
I tried putting this into my .basrc as well:<BR>
<BR>
LD_LIBRARY_PATH=/usr/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib<BR>
<BR>
<BR>
It seems like when I launch jobs via qsub, .bashrc is read. But when an mpi job spans more than one node, then it fails to find the correct environment variables. When I run my mpitest without qsub, then I can run on more than one node. So I am not understanding what the difference is between when I run MPI through torque/qsub and from the standard command line. <BR>
<BR>
In addition I did attempt Shenglong's suggestions without any luck. <BR>
<BR>
Thanks again<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
-----Original Message-----<BR>
From: torqueusers-bounces@supercluster.org on behalf of Shenglong Wang<BR>
Sent: Fri 3/18/2011 8:45 PM<BR>
To: Torque Users Mailing List<BR>
Subject: Re: [torqueusers] Torque environment problem<BR>
<BR>
Have you set LD_LIBRARY_PATH in your ~/.bashrc file? Did you try to include LD_LIBRARY_PATH to mpirun or mpiexec?<BR>
<BR>
np=$(cat $PBS_NODEFILE | wc -l)<BR>
<BR>
mpiexec -np $np -hostfile $PBS_NODEFILE env LD_LIBRARY_PATH=$LD_LIBRARY_PATH XXXX<BR>
<BR>
Best,<BR>
<BR>
Shenglong<BR>
<BR>
<BR>
<BR>
<BR>
On Mar 18, 2011, at 11:36 PM, Svancara, Randall wrote:<BR>
<BR>
> I just wanted to add that if I launch a job on one node, everything works fine. For example in my job script if I specify<BR>
><BR>
><BR>
> #PBS -l nodes=1:ppn=12<BR>
><BR>
> Then everything runs fine.<BR>
><BR>
><BR>
> However, if I specify two nodes, then everything fails.<BR>
><BR>
><BR>
> #PBS -l nodes=1:ppn=12<BR>
><BR>
> This also fails<BR>
><BR>
><BR>
> #PBS -l nodes=13<BR>
><BR>
> But this does not:<BR>
><BR>
><BR>
> #PBS -l nodes=12<BR>
><BR>
> Thanks,<BR>
><BR>
> Randall<BR>
><BR>
> -----Original Message-----<BR>
> From: torqueusers-bounces@supercluster.org on behalf of Svancara, Randall<BR>
> Sent: Fri 3/18/2011 7:48 PM<BR>
> To: torqueusers@supercluster.org<BR>
> Subject: [torqueusers] Torque environment problem<BR>
><BR>
><BR>
> Hi,<BR>
><BR>
> We are in the process of setting up a new cluster. One issue I am experiencing is with openmpi jobs launched through torque.<BR>
><BR>
> When I launch a simple job using a very basic mpi "Hello World" script I am seeing the following errors from openmpi:<BR>
><BR>
> **************************<BR>
><BR>
> [node164:06689] plm:tm: failed to poll for a spawned daemon, return status = 17002<BR>
> --------------------------------------------------------------------------<BR>
> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to<BR>
> launch so we are aborting.<BR>
><BR>
> There may be more information reported by the environment (see above).<BR>
><BR>
> This may be because the daemon was unable to find all the needed shared<BR>
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the<BR>
> location of the shared libraries on the remote nodes and this will<BR>
> automatically be forwarded to the remote nodes.<BR>
> --------------------------------------------------------------------------<BR>
> --------------------------------------------------------------------------<BR>
> mpirun noticed that the job aborted, but has no info as to the process<BR>
> that caused that situation.<BR>
> --------------------------------------------------------------------------<BR>
> --------------------------------------------------------------------------<BR>
> mpirun was unable to cleanly terminate the daemons on the nodes shown<BR>
> below. Additional manual cleanup may be required - please refer to<BR>
> the "orte-clean" tool for assistance.<BR>
> --------------------------------------------------------------------------<BR>
> node163 - daemon did not report back when launched<BR>
> Completed executing:<BR>
><BR>
> *************************<BR>
><BR>
> However when launch a job running mpiexec, everything seems to work fine using the following script:<BR>
><BR>
> /usr/mpi/intel/openmpi-1.4.3/bin/mpirun -hostfile /home/admins/rsvancara/hosts -n 24 /home/admins/rsvancara/TEST/mpitest<BR>
><BR>
> The job runs on 24 nodes with 12 processes per node.<BR>
><BR>
> I have verified that my .bashrc is working. I have tried to launch from an interactive job using qsub -I -lnodes=12:ppn12 without any success. I am assuming this is an environment problem, however, I am unsure as the openmpi error includes "MAY". <BR>
><BR>
> My question is:<BR>
><BR>
> 1. Has anyone had this problem before (I am sure they have)<BR>
> 2. How would I go about troubleshooting this problem.<BR>
><BR>
><BR>
> I am using torque version 2.4.7.<BR>
><BR>
> Thanks for any assistance anyone can provide.<BR>
><BR>
><BR>
> _______________________________________________<BR>
> torqueusers mailing list<BR>
> torqueusers@supercluster.org<BR>
> <A HREF="http://www.supercluster.org/mailman/listinfo/torqueusers">http://www.supercluster.org/mailman/listinfo/torqueusers</A><BR>
<BR>
<BR>
</FONT>
</P>
</BODY>
</HTML>