<HTML>
<HEAD>
<TITLE>Re: [torqueusers] Re: Newbie torque script questions</TITLE>
</HEAD>
<BODY>
<FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:14.0px'>Dave,<BR>
<BR>
Try in you pbs_script:<BR>
<BR>
-l nodes=n10:ppn=4+n11:ppn=4+n12:ppn=4+n13:ppn=4<BR>
<BR>
Make sure your $PBS_HOME/server_priv/nodes looks like<BR>
<BR>
n10 np=4<BR>
n11 np=4<BR>
..<BR>
..<BR>
<BR>
<BR>
Just a follow up. Are you wanting to get 4 nodes with 4 processors, and use only 1 processor per node? Your original mpirun line will only ask for 4 processors in which to run ( of which n10 has ) <BR>
<BR>
If you want to use all processors on all 4 nodes you would want to use –np 16.<BR>
<BR>
-nolocal assumes you do not want to run processes on the controlling pbs_mom ( n10 in this scenario ) therefore you are really only getting 12/16 processors. <BR>
<BR>
My other suggestion is to build Pete Wyckoff’s mpiexec in place of mpirun, as there are many advantages ( usage, differing flags, is built tightly into the Torque job spawn etc. ) <BR>
<a href="http://www.osc.edu/~pw/mpiexec/index.php">http://www.osc.edu/~pw/mpiexec/index.php</a><BR>
<BR>
<BR>
<BR>
Jerry Smith<BR>
-----------------------------------<BR>
Sandia national labs<BR>
Infrastructure Computing Systems<BR>
<BR>
<BR>
<HR ALIGN=CENTER SIZE="3" WIDTH="95%"><B>From: </B>dave first <linux4dave@gmail.com><BR>
<B>Date: </B>Wed, 6 Dec 2006 09:32:47 -0800<BR>
<B>To: </B><torqueusers@supercluster.org><BR>
<B>Subject: </B>[torqueusers] Re: Newbie torque script questions<BR>
<BR>
New datapoint - I ran the job with a 2 minute sleep, and found the job running only on n04, as qstat -f said it would be.<BR>
<BR>
Why wouldn't qsub honor my local node list?<BR>
<BR>
dave<BR>
<BR>
On 12/6/06, <B>dave first</B> < linux4dave@gmail.com <a href="mailto:linux4dave@gmail.com"><mailto:linux4dave@gmail.com></a> > wrote:<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:14.0px'>I am such a newbie that I squeek. I hope this is the correct forum in which to ask this question. <BR>
<BR>
I want to specify a nodelist other than that which would be $PBS_NODEFILE. I want to specify n10, n11, n12 and n13, each with 4 processors. The node list looks something like this:<BR>
<BR>
n10:4<BR>
n11:4<BR>
n12:4<BR>
n13:4<BR>
<BR>
And it is called local_nodelist in the working directory. <BR>
<BR>
The script sets PBS_NODEFILE=`pwd`/local_nodelist<BR>
<BR>
qstat -f while running the script elicits what seems to be an erroneous nodelist <BR>
<BR>
Job Id: 76.excalibur<BR>
Job_Name = pbs_mpich.<BR>
Job_Owner = joeb@excalibur.example.com<BR>
resources_used.cput = 00:00:00<BR>
resources_used.mem = 4296kb<BR>
resources_used.vmem = 175988kb<BR>
resources_used.walltime = 00:00:12<BR>
job_state = R<BR>
queue = default <BR>
server = excalibur.example.com <a href="http://excalibur.example.com"><http://excalibur.example.com></a> <BR>
Checkpoint = u<BR>
ctime = Wed Dec 6 08:54:16 2006<BR>
Error_Path = excalibur.example.com <a href="http://excalibur.example.com"><http://excalibur.example.com></a> :/home/joeb/pbs_mpich..e76<BR>
exec_host = n04/0<BR>
Hold_Types = n<BR>
Join_Path = n<BR>
Keep_Files = n<BR>
Mail_Points = a<BR>
mtime = Wed Dec 6 08:54:17 2006<BR>
Output_Path = excalibur.example.com <a href="http://excalibur.example.com"><http://excalibur.example.com></a> :/home/joeb/pbs_mpich..o76<BR>
Priority = 0<BR>
qtime = Wed Dec 6 08:54:16 2006<BR>
Rerunable = True<BR>
Resource_List.nodect = 1<BR>
Resource_List.nodes = 1<BR>
session_id = 31725<BR>
Variable_List = PBS_O_HOME=/home/joeb,PBS_O_LANG=en_US.UTF-8, <BR>
PBS_O_LOGNAME=joeb,<BR>
PBS_O_PATH=/opt/torque/bin:/opt/bin:/opt/hdfview/bin:/opt/hdf/bin:/opt<BR>
/ncarg/bin:/opt/mpich/p4-gnu/bin:/opt/mpiexec//bin:/usr/kerberos/bin:/o<BR>
pt/java/jdk1.5.0/bin:/usr/lib64/ccache/bin:/usr/local/bin:/bin:/usr/bin <BR>
:/usr/X11R6/bin:/opt/java/jdk1.5.0/jre/bin:/opt/visit/bin:/home/joeb/bi<BR>
n:/opt/mpich/p4-gnu/sbin,PBS_O_MAIL=/var/spool/mail/joeb<BR>
PBS_O_SHELL=/bin/bash,PBS_O_HOST= excalibur.example.com <a href="http://excalibur.example.com"><http://excalibur.example.com></a> ,<BR>
PBS_O_WORKDIR=/home/joeb,PBS_O_QUEUE=default<BR>
comment = Job started on Wed Dec 06 at 08:54<BR>
etime = Wed Dec 6 08:54:16 2006<BR>
--------------------------------------------------------------------------------- <BR>
<BR>
However, the script output looks like this:<BR>
<BR>
Job ID: 76.excalibur.example.com <a href="http://76.excalibur.example.com"><http://76.excalibur.example.com></a> <BR>
Working directory is /home/joeb<BR>
Running on host n04.example.com <a href="http://n04.example.com"><http://n04.example.com></a> <BR>
Time is Wed Dec 6 08:54:17 PST 2006<BR>
Directory is /home/joeb<BR>
The node file is /net/fs/home/joeb/local_nodefile <BR>
This job runs on the following processors:<BR>
n09.example.com:4 <a href="http://n09.example.com:4"><http://n09.example.com:4></a> n10.example.com:4 <a href="http://n10.example.com:4"><http://n10.example.com:4></a> n11.example.com:4 <a href="http://n11.example.com:4"><http://n11.example.com:4></a> n12.example.com:4 <a href="http://n12.example.com:4"><http://n12.example.com:4></a> <BR>
This job has allocated 4 nodes/processors.<BR>
<BR>
/usr/local/bin/mpich/x86_64/p4/gnu/bin/mpirun -nolocal -np 4 -machinefile /net/fs/home/joeb/local_nodefile /usr/local/bin/mpich/p <BR>
4-gnu/examples/cpi<BR>
<BR>
pi is approximately 3.1416009869231249, Error is 0.0000083333333318<BR>
wall clock time = 0.003906<BR>
--------------------------------------------------------------------------------- <BR>
<BR>
Can anyone explain why the output of qstat -f and the script echo statements differ, and how can I determine which is correct? (Short of sleeping for a while while I look for all the processes?) <BR>
<BR>
Thanks,<BR>
dave<BR>
<BR>
</SPAN></FONT></BLOCKQUOTE><FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:14.0px'><BR>
<BR>
<HR ALIGN=CENTER SIZE="3" WIDTH="95%"></SPAN></FONT><SPAN STYLE='font-size:14.0px'><FONT FACE="Monaco, Courier New">_______________________________________________<BR>
torqueusers mailing list<BR>
torqueusers@supercluster.org<BR>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers">http://www.supercluster.org/mailman/listinfo/torqueusers</a><BR>
</FONT></SPAN>
</BODY>
</HTML>