<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi,<br>
<div class="moz-forward-container"> <br>
I have installed a mini test cluster with torque and maui. We have
used maui/torque for years on our grid cluster and now we are
upgrading to torque 2.5.7 and maui 3.3-4. Unfortunately with this
new combination maui doesn't seem to work correctly. When I submit
jobs and it behaves as if there weren't any free resources. Even
when I tried to install only torque and maui with a bare minimum
configuration I got the same behaviour, i.e.<br>
<br>
1) When I submit the jobs just remain queued<br>
<br>
<small><i><small><i>[root@</i></small><small><i><server>
maui]# </i></small>qstat -an1</i><i><br>
</i><i><br>
</i><i><server>: </i><i><br>
</i><i>
Req'd Req'd Elap</i><i><br>
</i><i>Job ID Username Queue Jobname
SessID NDS TSK Memory Time S Time</i><i><br>
</i><i>-------------------- -------- -------- ----------------
------ ----- --- ------ ----- - -----</i><i><br>
</i><i>10.<server> aforti long
pbs-vm3.sh -- -- -- -- -- Q -- -- </i><i><br>
</i><i>11.s<server> aforti long
pbs-vm3.sh -- -- -- -- -- Q -- -- </i></small><br>
<br>
2) If I run qrun <jobid> the job runs so I assume the
problem is not between torque server and torque mom.<br>
3) When I use showq on the old versions displayed the WCLimit of
the default queue now it displays 0 at first and then it changes
it by itself to 100 days<br>
<br>
<small><i>showq<br>
ACTIVE JOBS--------------------<br>
JOBNAME USERNAME STATE PROC
REMAINING STARTTIME<br>
<br>
<br>
0 Active Jobs 0 of 16 Processors Active (0.00%)<br>
0 of 1 Nodes Active (0.00%)<br>
<br>
IDLE JOBS----------------------<br>
JOBNAME USERNAME STATE PROC
WCLIMIT QUEUETIME<br>
<br>
2 aforti Idle 1 99:23:59:59 Wed
Oct 10 13:36:34<br>
3 aforti Idle 1 99:23:59:59 Wed
Oct 10 14:01:43<br>
4 aforti Idle 1 99:23:59:59 Wed
Oct 10 18:50:14<br>
5 aforti Idle 1 00:00:00 Wed
Oct 10 20:29:27<br>
<br>
4 Idle Jobs<br>
<br>
BLOCKED JOBS----------------<br>
JOBNAME USERNAME STATE PROC
WCLIMIT QUEUETIME<br>
<br>
<br>
Total Jobs: 4 Active Jobs: 0 Idle Jobs: 4 Blocked Jobs:
0<br>
</i></small><small><i><br>
</i><i><br>
</i><i>Total Jobs: 2 Active Jobs: 0 Idle Jobs: 2 Blocked
Jobs: 0</i><i><br>
</i></small><br>
4) Checkjob <jobid> just tells me the job cannot be run in
the default partition without any particular reason<br>
<br>
<small><i>[.....]<br>
PE: 1.00 StartPriority: 120</i><i><br>
</i><i>cannot select job 10 for partition DEFAULT (Class)</i></small><br>
<br>
5) Checknode can see the node free if it wasn't clear from other
commands<br>
<br>
<small><i>[root@</i></small><small><i><server> maui]#
!checkno</i><i><br>
</i><i>checknode <node></i><i><br>
</i><i><br>
</i><i>checking node <node></i><i><br>
</i><i><br>
</i><i>State: Idle (in current state for 00:55:10)</i><i><br>
</i><i>Configured Resources: PROCS: 16 MEM: 23G SWAP: 31G
DISK: 1M</i><i><br>
</i><i>Utilized Resources: SWAP: 202M</i><i><br>
</i><i>Dedicated Resources: [NONE]</i><i><br>
</i><i>Opsys: linux Arch: [NONE]</i><i><br>
</i><i>Speed: 1.00 Load: 0.000</i><i><br>
</i><i>Network: [DEFAULT]</i><i><br>
</i><i>Features: [lcgpro]</i><i><br>
</i><i>Attributes: [Batch]</i><i><br>
</i><i>Classes: [DEFAULT 1:1]</i><i><br>
</i><i><br>
</i><i>Total Time: 3:06:35 Up: 3:06:24 (99.90%) Active:
00:00:10 (0.09%)</i><i><br>
</i><i><br>
</i><i>Reservations:</i><i><br>
</i><i>NOTE: no reservations on node</i></small><br>
<br>
6) When I use showbf -v though it says my nodes are blocked by
reservations despite checknode clearly telling me there are no
reservations on that node. In our local maui.cfg there is a
reservation for 1 proc I'm not sure why it blocks the whole node <br>
<br>
<small><i>[root@</i></small><small><i><server2>
server_logs]# showbf -v</i><i><br>
</i><i>backfill window (user: 'root' group: 'root' partition:
ALL) Tue Oct 9 17:08:59</i><i><br>
</i><i><br>
</i><i> 3 procs available with no timelimit</i><i><br>
</i><i><br>
</i><i>node <node2> is blocked by reservation sft.0.0 in
INFINITY</i><i><br>
</i><big><br>
But to be sure I removed it and even when I remove the
reservation and reduce the maui.cfg to the default version
without anything in it it tells me the node is blocked by
"reservation NONE in INFINITY"<br>
<br>
<small><i>[root@</i></small></big></small><small><big><small><i><server>
maui]# showbf -v</i><i><br>
</i><i>backfill window (user: 'root' group: 'root'
partition: ALL) Tue Oct 9 17:37:58</i><i><br>
</i><i><br>
</i><i> 16 procs available with no timelimit</i><i><br>
</i><i><br>
</i><i>node <node> is blocked by reservation NONE in
INFINITY</i><i><br>
</i><big></big></small></big></small><br>
<small><big><small><big>If I increase the maui loglevel to 9 I
hundreds of these messages<br>
<br>
<small><i>10/10 13:37:39 MRMCheckEvents()</i><i><br>
</i><i>10/10 13:37:39 INFO: no PBS sched socket
connections ready</i><i><br>
</i><i>10/10 13:37:39
MSUAcceptClient(6,ClientSD,HostName,TCP)</i><i><br>
</i><i>10/10 13:37:39 INFO: accept call failed,
errno: 11 (Resource temporarily unavailable)</i><i><br>
</i><i>10/10 13:37:39 INFO: all clients connected.
servicing requests</i><i><br>
</i></small> <br>
which leaves me perplexed since in other places with a
different log level it sees the jobs waiting on the server
so somehow some comunication happens and other doesn't<br>
<br>
<small><i>10/10 20:27:24 INFO: job '2' Priority:
410</i><i><br>
</i><i>10/10 20:27:24 INFO: Cred: 0(00.0)
FS: 0(00.0) Attr: 0(00.0) Serv:
410(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)</i><i><br>
</i><i>10/10 20:27:24 INFO: job '2' priority:
410.30</i><i><br>
</i><i>10/10 20:27:24
MJobGetStartPriority(3,0,Priority,NULL)</i><i><br>
</i><i>10/10 20:27:24 INFO: job '3' Priority:
385</i><i><br>
</i><i>10/10 20:27:24 INFO: Cred: 0(00.0)
FS: 0(00.0) Attr: 0(00.0) Serv:
385(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)</i><i><br>
</i><i>10/10 20:27:24 INFO: job '3' priority:
385.30</i><i><br>
</i><i>10/10 20:27:24
MJobGetStartPriority(4,0,Priority,NULL)</i><i><br>
</i><i>10/10 20:27:24 INFO: job '4' Priority:
97</i><i><br>
</i><i>10/10 20:27:24 INFO: Cred: 0(00.0)
FS: 0(00.0) Attr: 0(00.0) Serv:
97(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)</i><i><br>
</i><i>10/10 20:27:24 INFO: job '4' priority:
97.17</i><i><br>
</i></small><br>
Thanks for any help here are the rpms I used<br>
<br>
<small><i>maui-3.3-4.el5</i><i><br>
</i><i>maui-client-3.3-4.el5</i><i><br>
</i><i>maui-server-3.3-4.el5</i><i><br>
</i><i>torque-2.5.7-7.el5</i><i><br>
</i><i>torque-client-2.5.7-7.el5</i><i><br>
</i><i>torque-server-2.5.7-7.el5</i></small></big></small></big><i><br>
</i><i><small><big><small><big>libtorque-2.5.7-7.el5</big></small></big></small></i></small><big><small><i><br>
</i></small><br>
the maui.cfg<br>
<br>
<i><small><small># <br>
# MAUI configuration example<br>
# @(#)maui.cfg David Groep 20031015.1<br>
# for MAUI version 3.2.5<br>
# <br>
SERVERHOST <server></small></small></i></big><br>
<big><i><small><small>ADMIN1 root <br>
ADMINHOST <server></small></small></i></big><br>
<big><i><small><small>RMTYPE[0] PBS<br>
RMHOST[0] <server></small></small></i></big><br>
<big><i><small><small>RMSERVER[0] <server></small></small></i></big><br>
<big><i><small><small><br>
SERVERPORT 40559<br>
SERVERMODE NORMAL<br>
<br>
# Set PBS server polling interval. Since we have many
short jobs<br>
# and want fast turn-around, set this to 10 seconds
(default: 2 minutes)<br>
RMPOLLINTERVAL 00:00:10<br>
<br>
# a max. 10 MByte log file in a logical location<br>
LOGFILE /var/log/maui.log<br>
LOGFILEMAXSIZE 10000000<br>
LOGLEVEL 3</small></small></i><br>
<br>
</big>and Torque config<br>
<br>
<small><i>create queue long</i><i><br>
</i><i>set queue long queue_type = Execution</i><i><br>
</i><i>set queue long acl_hosts = localhost</i><i><br>
</i><i>set queue long acl_hosts += <server></i><i><br>
</i><i>set queue long resources_max.cput = 48:00:00</i><i><br>
</i><i>set queue long resources_max.walltime = 72:00:00</i><i><br>
</i><i>set queue long acl_group_enable = True</i><i><br>
</i><i>set queue long acl_groups = aforti</i><i><br>
</i><i>set queue long enabled = True</i><i><br>
</i><i>set queue long started = True</i><i><br>
</i><i>#</i><i><br>
</i><i># Set server attributes.</i><i><br>
</i><i>#</i><i><br>
</i><i>set server scheduling = True</i><i><br>
</i><i>set server acl_host_enable = False</i><i><br>
</i><i>set server acl_hosts = <server></i><i><br>
</i><i>set server acl_hosts += localhost</i><i><br>
</i><i>set server default_queue = long</i><i><br>
</i><i>set server log_events = 511</i><i><br>
</i><i>set server mail_from = adm</i><i><br>
</i><i>set server next_job_number = 12</i></small><br>
<pre class="moz-signature" cols="72">--
Facts aren't facts if they come from the wrong people. (Paul Krugman)
</pre>
<br>
<br>
</div>
<br>
</body>
</html>