<div>Hi,</div><div><br></div><div>I'm using maui-3.2.6p21-snap.1252608389 and torque-2.4.2.</div><div>I have 2 nodes, each with 4 cpus.</div><div><br></div><div>$ cat pbs.sh</div><div><div><div>#PBS -l nodes=2:ppn=4</div>
<div>#PBS -N hpl-8cpus</div><div>#PBS -j oe</div><div><br></div><div>cd /home/admin/hpl/hpl-2.0-openmpi</div><div><br></div><div>cat $PBS_NODEFILE</div><div><br></div><div>NP=`wc -l $PBS_NODEFILE | awk '{ print $1 }'`</div>
<div><br></div><div>cat $PBS_NODEFILE | awk '{ print $1"-clust" }' > ./machines</div><div><br></div><div>cd /home/admin/hpl/hpl-2.0-openmpi</div><div>/usr/mpi/gcc/openmpi-1.3.2/bin/mpirun -np $NP -machinefile ./machines ./bin/core2-goto-openmpi/xhpl</div>
<div><br></div><div><div>$ checkjob -v 30</div><div><br></div><div><br></div><div>checking job 30 (RM job '<a href="http://30.mgmt.v5cluster.com">30.mgmt.v5cluster.com</a>')</div><div><br></div><div>State: Running</div>
<div>Creds: user:admin group:admin class:batch qos:DEFAULT</div><div>WallTime: 00:00:00 of 1:00:00</div><div>SubmitTime: Sat Nov 7 11:00:24</div><div> (Time Queued Total: 00:00:01 Eligible: 00:00:01)</div><div><br>
</div><div>StartTime: Sat Nov 7 11:00:25</div><div>Total Tasks: 8</div><div><br></div><div>Req[0] TaskCount: 8 Partition: DEFAULT</div><div>Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0</div><div>Opsys: [NONE] Arch: [NONE] Features: [NONE]</div>
<div>Exec: '' ExecSize: 0 ImageSize: 0</div><div>Dedicated Resources Per Task: PROCS: 1</div><div>Utilized Resources Per Task: [NONE]</div><div>Avg Util Resources Per Task: [NONE]</div><div>Max Util Resources Per Task: [NONE]</div>
<div>NodeAccess: SHARED</div><div>TasksPerNode: 4 NodeCount: 2</div><div>Allocated Nodes:</div><div>[node0002:4][node0001:4]</div><div>Task Distribution: node0002,node0002,node0002,node0002,node0001,node0001,node0001,node0001</div>
<div><br></div><div><br></div><div>IWD: [NONE] Executable: [NONE]</div><div>Bypass: 0 StartCount: 1</div><div>PartitionMask: [ALL]</div><div>Flags: RESTARTABLE</div><div><br></div><div>Reservation '30' (00:00:00 -> 1:00:00 Duration: 1:00:00)</div>
<div>PE: 8.00 StartPriority: 1</div><div><br></div></div><div><br></div><div>$ qstat -f</div><div><div>Job Id: <a href="http://30.mgmt.v5cluster.com">30.mgmt.v5cluster.com</a></div><div> Job_Name = hpl-8cpus</div><div>
Job_Owner = <a href="mailto:admin@mgmt.v5cluster.com">admin@mgmt.v5cluster.com</a></div><div> job_state = R</div><div> queue = batch</div><div> server = <a href="http://mgmt.v5cluster.com">mgmt.v5cluster.com</a></div>
<div> Checkpoint = u</div><div> ctime = Sat Nov 7 11:00:24 2009</div><div> Error_Path = mgmt.v5cluster.com:/home/admin/hpl/hpl-2.0-openmpi/hpl-8cpus.</div><div> e30</div><div> exec_host = node0001/3+node0001/2+node0001/1+node0001/0</div>
<div> Hold_Types = n</div><div> Join_Path = oe</div><div> Keep_Files = n</div><div> Mail_Points = a</div><div> mtime = Sat Nov 7 11:00:25 2009</div><div> Output_Path = mgmt.v5cluster.com:/home/admin/hpl/hpl-2.0-openmpi/hpl-8cpus</div>
<div> .o30</div><div> Priority = 0</div><div> qtime = Sat Nov 7 11:00:24 2009</div><div> Rerunable = True</div><div> Resource_List.nodect = 2</div><div> Resource_List.nodes = 2:ppn=4</div><div> Resource_List.walltime = 01:00:00</div>
<div> session_id = 16102</div><div> Variable_List = PBS_O_HOME=/home/admin,PBS_O_LANG=en_US.UTF-8,</div><div> PBS_O_LOGNAME=admin,</div><div> PBS_O_PATH=/usr/lib64/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin</div>
<div> :/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/sbi</div><div> n:/usr/bin:/root/bin:/usr/sbin:/usr/bin,</div><div> PBS_O_MAIL=/var/spool/mail/root,PBS_O_SHELL=/bin/bash,</div>
<div> PBS_O_HOST=<a href="http://mgmt.v5cluster.com">mgmt.v5cluster.com</a>,PBS_SERVER=<a href="http://mgmt.v5cluster.com">mgmt.v5cluster.com</a>,</div><div> PBS_O_WORKDIR=/home/admin/hpl/hpl-2.0-openmpi,PBS_O_QUEUE=batch</div>
<div> etime = Sat Nov 7 11:00:24 2009</div><div> submit_args = pbs.sh</div><div> start_time = Sat Nov 7 11:00:25 2009</div><div> start_count = 1</div><div> fault_tolerant = False</div><div><br></div></div>
<div><br></div><div>Below is the maui.log.</div><div><br></div><div><div>11/07 11:11:29 INFO: connect request from 11.1.0.1</div><div>11/07 11:11:29 INFO: received service request from host '<a href="http://mgmt.v5cluster.com">mgmt.v5cluster.com</a>'</div>
<div>11/07 11:11:29 MSURecvPacket(9,BufP,4,NULL,100000,SC)</div><div>11/07 11:11:31 ServerProcessRequests()</div><div>11/07 11:11:31 INFO: not rolling logs (5304 < 10000000)</div><div>11/07 11:11:31 MResAdjust(NULL,0,0)</div>
<div>11/07 11:11:31 MStatInitializeActiveSysUsage()</div><div>11/07 11:11:31 MStatClearUsage([NONE],Active)</div><div>11/07 11:11:31 ServerUpdate()</div><div>11/07 11:11:31 MSysUpdateTime()</div><div>11/07 11:11:31 INFO: starting iteration 77</div>
<div>11/07 11:11:31 MRMGetInfo()</div><div>11/07 11:11:31 MClusterClearUsage()</div><div>11/07 11:11:31 MRMClusterQuery()</div><div>11/07 11:11:31 MPBSClusterQuery(base,RCount,SC)</div><div>11/07 11:11:31 __MPBSGetNodeState(Name,State,PNode)</div>
<div>11/07 11:11:31 INFO: PBS node node0001 set to state Idle (free)</div><div>11/07 11:11:31 MPBSNodeUpdate(node0001,node0001,Idle,base)</div><div>11/07 11:11:31 MPBSLoadQueueInfo(base,node0001,SC)</div><div>11/07 11:11:31 INFO: queue 'batch' started state set to True</div>
<div>11/07 11:11:31 INFO: class to node not mapping enabled for queue 'batch' adding class to all nodes</div><div>11/07 11:11:31 __MPBSGetNodeState(Name,State,PNode)</div><div>11/07 11:11:31 INFO: PBS node node0002 set to state Idle (free)</div>
<div>11/07 11:11:31 MPBSNodeUpdate(node0002,node0002,Idle,base)</div><div>11/07 11:11:31 MPBSLoadQueueInfo(base,node0002,SC)</div><div>11/07 11:11:31 INFO: queue 'batch' started state set to True</div><div>11/07 11:11:31 INFO: class to node not mapping enabled for queue 'batch' adding class to all nodes</div>
<div>11/07 11:11:31 INFO: 2 PBS resources detected on RM base</div><div>11/07 11:11:31 INFO: resources detected: 2</div><div>11/07 11:11:31 MRMWorkloadQuery()</div><div>11/07 11:11:31 MPBSWorkloadQuery(base,JCount,SC)</div>
<div>11/07 11:11:31 MPBSJobLoad(31,<a href="http://31.mgmt.v5cluster.com">31.mgmt.v5cluster.com</a>,J,TaskList,0)</div><div>11/07 11:11:31 MReqCreate(31,SrcRQ,DstRQ,DoCreate)</div><div>11/07 11:11:31 INFO: processing node request line '2:ppn=4'</div>
<div>11/07 11:11:31 MJobSetCreds(31,admin,admin,)</div><div>11/07 11:11:31 INFO: default QOS for job 31 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])</div><div>11/07 11:11:31 INFO: default QOS for job 31 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])</div>
<div>11/07 11:11:31 INFO: default QOS for job 31 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])</div><div>11/07 11:11:31 INFO: job '31' loaded: 8 admin admin 3600 Idle 0 1257563489 [NONE] [NONE] [NONE] >= 0 >= 0 [NONE] 1257563491</div>
<div>11/07 11:11:31 INFO: 1 PBS jobs detected on RM base</div><div>11/07 11:11:31 INFO: jobs detected: 1</div><div>11/07 11:11:31 MStatClearUsage(node,Active)</div><div>11/07 11:11:31 MClusterUpdateNodeState()</div>
<div>11/07 11:11:31 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)</div><div>11/07 11:11:31 INFO: job '31' Priority: 1</div><div>11/07 11:11:31 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)</div>
<div>11/07 11:11:31 MStatClearUsage([NONE],Active)</div><div>11/07 11:11:31 INFO: total jobs selected (ALL): 1/1</div><div>11/07 11:11:31 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)</div><div>11/07 11:11:31 INFO: job '31' Priority: 1</div>
<div>11/07 11:11:31 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)</div><div>11/07 11:11:31 MStatClearUsage([NONE],Idle)</div>
<div>11/07 11:11:31 INFO: total jobs selected (ALL): 1/1</div><div>11/07 11:11:31 MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE)</div><div>11/07 11:11:31 INFO: total jobs selected in partition ALL: 1/1</div>
<div>11/07 11:11:31 MQueueScheduleRJobs(Q)</div><div>11/07 11:11:31 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)</div><div>11/07 11:11:31 INFO: total jobs selected in partition ALL: 1/1</div>
<div>11/07 11:11:31 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)</div><div>11/07 11:11:31 INFO: total jobs selected in partition DEFAULT: 1/1</div><div>11/07 11:11:31 MQueueScheduleIJobs(Q,DEFAULT)</div>
<div>11/07 11:11:31 INFO: 8 feasible tasks found for job 31:0 in partition DEFAULT (8 Needed)</div><div>11/07 11:11:31 INFO: tasks located for job 31: 8 of 8 required (8 feasible)</div><div>11/07 11:11:31 MJobStart(31)</div>
<div>11/07 11:11:31 MJobDistributeTasks(31,base,NodeList,TaskMap)</div><div>11/07 11:11:31 MAMAllocJReserve(31,RIndex,ErrMsg)</div><div>11/07 11:11:31 MRMJobStart(31,Msg,SC)</div><div>11/07 11:11:31 MPBSJobStart(31,base,Msg,SC)</div>
<div>11/07 11:11:31 INFO: job '31' successfully started</div><div>11/07 11:11:31 MStatUpdateActiveJobUsage(31)</div><div>11/07 11:11:31 MResJCreate(31,MNodeList,00:00:00,ActiveJob,Res)</div><div>11/07 11:11:31 INFO: starting job '31'</div>
<div>11/07 11:11:31 INFO: 1 jobs started on iteration 77</div><div>Active Jobs------</div><div>------------------</div><div>11/07 11:11:31 INFO: resources available after scheduling: N: 0 P: 0</div><div>11/07 11:11:31 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)</div>
<div>11/07 11:11:31 INFO: total jobs selected in partition DEFAULT: 0/1 [State: 1]</div><div>11/07 11:11:31 MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE)</div><div>11/07 11:11:31 INFO: total jobs selected in partition ALL: 0/1 [State: 1]</div>
<div>11/07 11:11:31 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)</div><div>11/07 11:11:31 INFO: total jobs selected in partition ALL: 0/1 [State: 1]</div><div>11/07 11:11:31 MSchedUpdateStats()</div>
<div>11/07 11:11:31 INFO: iteration: 77 scheduling time: 0.008 seconds</div><div>11/07 11:11:31 MResUpdateStats()</div><div>11/07 11:11:31 INFO: current util[77]: 2/2 (100.00%) PH: 0.88% active jobs: 1 of 2 (completed: 29)</div>
<div>11/07 11:11:31 MQueueCheckStatus()</div><div>11/07 11:11:31 MNodeCheckStatus()</div><div>11/07 11:11:31 MUClearChild(PID)</div><div>11/07 11:11:31 INFO: scheduling complete. sleeping 30 seconds</div><div><br></div>
<div><br></div><div><br></div></div><div><br></div><div>But I can see that the checkjob command can show the allocated nodes correctly. It seems that Maui runs correctly.</div></div><div><div>However in the exec_host and the $PBS_NODEFILE, it only allocated 4 cpus in the same node.</div>
<div>Is it the Torque problem?</div><div><br></div><div>I've tried to add "JOBNODEMATCHPOLICY EXACTNODE" and "ENABLEMULTIREQJOBS TRUE" to the maui.cfg but no help.</div><div><br></div><div>
<div>Anyone know how to solve this? Any suggestion is appreciated.</div><div><br></div></div><div>Thanks.</div><div><br></div></div><div><br></div></div><br>-- <br>Best Regards,<br>PN Lai<br><br>