<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi,<br>
<br>
I have installed a mini test cluster with torque and maui. We have
used maui/torque for years on our grid cluster and now we are
upgrading to torque 2.5.7 and maui 3.3-4. Unfortunately with this
new combination maui doesn't seem to work correctly. When I submit
jobs and it behaves as if there weren't any free resources. Even
when I tried to install only torque and maui with a bare minimum
configuration I got the same behaviour, i.e.<br>
<br>
1) When I submit the jobs just remain queued<br>
<br>
<small><i><small><i>[root@</i></small><small><i><server>
maui]# </i></small>qstat -an1</i><i><br>
</i><i><br>
</i><i><server>: </i><i><br>
</i><i>
Req'd Req'd Elap</i><i><br>
</i><i>Job ID Username Queue Jobname
SessID NDS TSK Memory Time S Time</i><i><br>
</i><i>-------------------- -------- -------- ----------------
------ ----- --- ------ ----- - -----</i><i><br>
</i><i>10.<server> aforti long pbs-vm3.sh
-- -- -- -- -- Q -- -- </i><i><br>
</i><i>11.s<server> aforti long pbs-vm3.sh
-- -- -- -- -- Q -- -- </i></small><br>
<br>
2) If I run qrun <jobid> the job runs so I assume the problem
is not between torque server and torque mom.<br>
3) When I use showq on the old versions displayed the WCLimit of the
default queue now it displays 0 at first and then it changes it by
itself to 100 days<br>
<br>
<small><i>[root@</i></small><small><i><server> maui]# showq</i><i><br>
</i><i>ACTIVE JOBS--------------------</i><i><br>
</i><i>JOBNAME USERNAME STATE PROC
REMAINING STARTTIME</i><i><br>
</i><i><br>
</i><i><br>
</i><i> 0 Active Jobs 0 of 16 Processors Active
(0.00%)</i><i><br>
</i><i> 0 of 1 Nodes Active
(0.00%)</i><i><br>
</i><i><br>
</i><i>IDLE JOBS----------------------</i><i><br>
</i><i>JOBNAME USERNAME STATE PROC
WCLIMIT QUEUETIME</i><i><br>
</i><i><br>
</i><i>10 aforti Idle 1 99:23:59:59
Tue Oct 9 15:32:13</i><i><br>
</i><i>11 aforti Idle 1 99:23:59:59
Tue Oct 9 16:39:09</i><i><br>
</i><i><br>
</i><i>2 Idle Jobs</i><i><br>
</i><i><br>
</i><i>BLOCKED JOBS----------------</i><i><br>
</i><i>JOBNAME USERNAME STATE PROC
WCLIMIT QUEUETIME</i><i><br>
</i><i><br>
</i><i><br>
</i><i>Total Jobs: 2 Active Jobs: 0 Idle Jobs: 2 Blocked
Jobs: 0</i><i><br>
</i></small><br>
4) Checkjob <jobid> just tells me the job cannot be run in the
default partition without any particular reason<br>
<br>
<small><i>[.....]<br>
PE: 1.00 StartPriority: 120</i><i><br>
</i><i>cannot select job 10 for partition DEFAULT (Class)</i></small><br>
<br>
5) Checknode can see the node free if it wasn't clear from other
commands<br>
<br>
<small><i>[root@</i></small><small><i><server> maui]# !checkno</i><i><br>
</i><i>checknode <node></i><i><br>
</i><i><br>
</i><i>checking node <node></i><i><br>
</i><i><br>
</i><i>State: Idle (in current state for 00:55:10)</i><i><br>
</i><i>Configured Resources: PROCS: 16 MEM: 23G SWAP: 31G DISK:
1M</i><i><br>
</i><i>Utilized Resources: SWAP: 202M</i><i><br>
</i><i>Dedicated Resources: [NONE]</i><i><br>
</i><i>Opsys: linux Arch: [NONE]</i><i><br>
</i><i>Speed: 1.00 Load: 0.000</i><i><br>
</i><i>Network: [DEFAULT]</i><i><br>
</i><i>Features: [lcgpro]</i><i><br>
</i><i>Attributes: [Batch]</i><i><br>
</i><i>Classes: [DEFAULT 1:1]</i><i><br>
</i><i><br>
</i><i>Total Time: 3:06:35 Up: 3:06:24 (99.90%) Active: 00:00:10
(0.09%)</i><i><br>
</i><i><br>
</i><i>Reservations:</i><i><br>
</i><i>NOTE: no reservations on node</i></small><br>
<br>
6) When I use showbf -v though it says my nodes are blocked by
reservations despite checknode clearly telling me there are no
reservations on that node. In our local maui.cfg there is a
reservation for 1 proc I'm not sure why it blocks the whole node <br>
<br>
<small><i>[root@</i></small><small><i><server2> server_logs]#
showbf -v</i><i><br>
</i><i>backfill window (user: 'root' group: 'root' partition: ALL)
Tue Oct 9 17:08:59</i><i><br>
</i><i><br>
</i><i> 3 procs available with no timelimit</i><i><br>
</i><i><br>
</i><i>node <node2> is blocked by reservation sft.0.0 in
INFINITY</i><i><br>
</i><big><br>
But to be sure I removed it and even when I remove the
reservation and reduce the maui.cfg to the default version
without anything in it it tells me the node is blocked by
"reservation NONE in INFINITY"<br>
<br>
<small><i>[root@</i></small></big></small><small><big><small><i><server>
maui]# showbf -v</i><i><br>
</i><i>backfill window (user: 'root' group: 'root' partition:
ALL) Tue Oct 9 17:37:58</i><i><br>
</i><i><br>
</i><i> 16 procs available with no timelimit</i><i><br>
</i><i><br>
</i><i>node <node> is blocked by reservation NONE in
INFINITY</i><i><br>
</i><big><br>
I'm not sure how to proceed because the log files don't tell
me anything and all the references I have found to a similar
problem have remained unanswered. <br>
<br>
Thanks for any help here are the rpms I used<br>
<br>
<small><i>maui-3.3-4.el5</i><i><br>
</i><i>maui-client-3.3-4.el5</i><i><br>
</i><i>maui-server-3.3-4.el5</i><i><br>
</i><i>torque-2.5.7-7.el5</i><i><br>
</i><i>torque-client-2.5.7-7.el5</i><i><br>
</i><i>torque-server-2.5.7-7.el5</i></small></big></small></big><i><br>
</i><i><small><big><small><big>libtorque-2.5.7-7.el5</big></small></big></small></i></small><big><small><i><br>
</i></small><br>
the maui.cfg<br>
<br>
<i><small><small># <br>
# MAUI configuration example<br>
# @(#)maui.cfg David Groep 20031015.1<br>
# for MAUI version 3.2.5<br>
# <br>
SERVERHOST <server></small></small></i></big><br>
<big><i><small><small>ADMIN1 root <br>
ADMINHOST <server></small></small></i></big><br>
<big><i><small><small>RMTYPE[0] PBS<br>
RMHOST[0] <server></small></small></i></big><br>
<big><i><small><small>RMSERVER[0] <server></small></small></i></big><br>
<big><i><small><small><br>
SERVERPORT 40559<br>
SERVERMODE NORMAL<br>
<br>
# Set PBS server polling interval. Since we have many short
jobs<br>
# and want fast turn-around, set this to 10 seconds
(default: 2 minutes)<br>
RMPOLLINTERVAL 00:00:10<br>
<br>
# a max. 10 MByte log file in a logical location<br>
LOGFILE /var/log/maui.log<br>
LOGFILEMAXSIZE 10000000<br>
LOGLEVEL 3</small></small></i><br>
<br>
</big>and Torque config<br>
<br>
<small><i>create queue long</i><i><br>
</i><i>set queue long queue_type = Execution</i><i><br>
</i><i>set queue long acl_hosts = localhost</i><i><br>
</i><i>set queue long acl_hosts += <server></i><i><br>
</i><i>set queue long resources_max.cput = 48:00:00</i><i><br>
</i><i>set queue long resources_max.walltime = 72:00:00</i><i><br>
</i><i>set queue long acl_group_enable = True</i><i><br>
</i><i>set queue long acl_groups = aforti</i><i><br>
</i><i>set queue long enabled = True</i><i><br>
</i><i>set queue long started = True</i><i><br>
</i><i>#</i><i><br>
</i><i># Set server attributes.</i><i><br>
</i><i>#</i><i><br>
</i><i>set server scheduling = True</i><i><br>
</i><i>set server acl_host_enable = False</i><i><br>
</i><i>set server acl_hosts = <server></i><i><br>
</i><i>set server acl_hosts += localhost</i><i><br>
</i><i>set server default_queue = long</i><i><br>
</i><i>set server log_events = 511</i><i><br>
</i><i>set server mail_from = adm</i><i><br>
</i><i>set server next_job_number = 12</i></small><br>
<pre class="moz-signature" cols="72">--
Facts aren't facts if they come from the wrong people. (Paul Krugman)
</pre>
</body>
</html>