Found this post online : <a href="http://www.supercluster.org/pipermail/mauiusers/2010-February/004116.html">http://www.supercluster.org/pipermail/mauiusers/2010-February/004116.html</a><br><br>I also have JOBNODEMATCHPOLICY EXACTNODE and NODEACCESSPOLICY SINGLEJOB set in the configuration. Could this bug still be there with maui ?<br>
<br>I tested with a smaller cluster size and let me explain the scenario again :<br><br>This time I have a 6 node cluster with Torque-3.0.3 and Maui running. Additional configuration in my Maui configuration file :<br><br>
----------<br>BACKFILLPOLICY FIRSTFIT<br>RESERVATIONPOLICY CURRENTHIGHEST<br><br>ENABLEMULTIREQJOBS TRUE<br>JOBNODEMATCHPOLICY EXACTNODE<br>NODEACCESSPOLICY SINGLEJOB<br><br>----------<br><br>Now I submit a job a 2 node job with following resource requirement :<br>
<br>----------<br>#PBS -l nodes=2,walltime=0:10:00<br>---------<br><br>This job starts on node1/0 + node2/0<br><br>Now, I submit another 4 node job with the following resource requirement :<br><br>---------<br>#PBS -l nodes=1:ppn=2+3,walltime=0:05:00<br>
--------<br><br>This job is also started but with following resources : <b>node3/0 + node3/1</b> + <b>node4/0 + node4/1</b> + node5/0<br><br>I would expect this job to use the resources as follows : node3/0 + node3/1 + node4/0 + node5/0 + node6/0<br>
But it did not use node6 at all, instead it used node3 and node4 to put 2 procs on each of them and node5 with another proc. node6 remained idle.<br><br>Is this a bug or some other configuration / setting is required ?<br>
<br>Thanks,<br>Kunal<br><br><div class="gmail_quote">On Fri, Jun 1, 2012 at 3:57 PM, Kunal Rao <span dir="ltr"><<a href="mailto:kunalgrao@gmail.com" target="_blank">kunalgrao@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
I removed NODEALLOCATIONPOLICY and tried again, this time it started the job but the node allocation was not as expected.<br><br>The job needs 1 node with 2 proc and 3 nodes with 1 proc each. The allocation was done on only 3 nodes. 2 with 2 procs and 1 with 1 proc. Not sure if this is a bug or some conflicts in the configuration.<br>
<br>My current additional configurations are :<div class="im"><br><br>BACKFILLPOLICY FIRSTFIT<br>RESERVATIONPOLICY CURRENTHIGHEST<br><br>ENABLEMULTIREQJOBS TRUE<br></div>JOBNODEMATCHPOLICY EXACTNODE<br>NODEACCESSPOLICY SINGLEJOB<br>
<br>I also tried with this, but still the same :<div class="im"><br><br>BACKFILLPOLICY FIRSTFIT<br>RESERVATIONPOLICY CURRENTHIGHEST<br><br>ENABLEMULTIREQJOBS TRUE<br></div>
NODEALLOCATIONPOLICY PRIORITY<br>
NODECFG[DEFAULT] PRIORITYF='APROCS'<br>
JOBNODEMATCHPOLICY EXACTNODE<br>
NODEACCESSPOLICY SINGLEJOB<br><br>Any suggestions ?<br><br>Thanks,<br>Kunal<div class="HOEnZb"><div class="h5"><br><br><br><div class="gmail_quote">On Thu, May 31, 2012 at 10:26 PM, Kunal Rao <span dir="ltr"><<a href="mailto:kunalgrao@gmail.com" target="_blank">kunalgrao@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I need NODEACCESSPOLICY, maybe I'll remove NODEALLOCATIONPOLICY and check tomorrow.<div><br></div>
<div>Thanks,</div><div>Kunal<div><div><br><br><div class="gmail_quote">On Thu, May 31, 2012 at 10:23 PM, Ju JiaJia <span dir="ltr"><<a href="mailto:jujj603@gmail.com" target="_blank">jujj603@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Seems all be ok. I think you could try to delete the additional configuration in maui.cfg. like <span>NODEALLOCATIONPOLICY, </span><span>NODEACCESSPOLICY, or use default or other options.</span></div>
<div><div><div>
<span><br></span></div><br><div class="gmail_quote">On Fri, Jun 1, 2012 at 9:59 AM, Kunal Rao <span dir="ltr"><<a href="mailto:kunalgrao@gmail.com" target="_blank">kunalgrao@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Each node has 16 cores. TORQUE_HOME/sever_priv/nodes file has for each of the 10 nodes :<div><br></div><div><node_name> np=16 gpus=1</div><div><br></div><div>Thanks,</div><div>Kunal<div><div><br><div><br>
<div class="gmail_quote">
On Thu, May 31, 2012 at 9:54 PM, Ju JiaJia <span dir="ltr"><<a href="mailto:jujj603@gmail.com" target="_blank">jujj603@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
How many cores on each of the 10 nodes ? I mean you are trying to allocate 2 processors on one node. And how did you configure TORQUE_HOME/server_priv/nodes ?<div><div><br><br><div class="gmail_quote">
On Fri, Jun 1, 2012 at 8:54 AM, Kunal Rao <span dir="ltr"><<a href="mailto:kunalgrao@gmail.com" target="_blank">kunalgrao@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Queue / Server configuration :<br>
<br>
---------------<br>
<br>
qmgr -c 'p s'<br>
#<br>
# Create queues and set their attributes.<br>
#<br>
#<br>
# Create and define queue batch<br>
#<br>
create queue batch<br>
set queue batch queue_type = Execution<br>
set queue batch resources_default.nodes = 1<br>
set queue batch resources_default.walltime = 01:00:00<br>
set queue batch enabled = True<br>
set queue batch started = True<br>
#<br>
# Set server attributes.<br>
#<br>
set server scheduling = True<br>
set server acl_hosts = fire16<br>
set server acl_roots = root@fire16.csa.local<br>
set server managers = root@fire16.csa.local<br>
set server operators = root@fire16.csa.local<br>
set server default_queue = batch<br>
set server log_events = 511<br>
set server mail_from = adm<br>
set server scheduler_iteration = 20<br>
set server node_check_rate = 150<br>
set server tcp_timeout = 6<br>
set server mom_job_sync = True<br>
set server keep_completed = 300<br>
set server allow_node_submit = True<br>
set server next_job_number = 6331<br>
<br>
---------------<br>
<br>
Job resource requirement :<br>
<br>
---------<br>
<br>
#PBS -l nodes=1:ppn=2+3,walltime=0:05:00<br>
<br>
---------<br>
<br>
"pbsnodes -a" shows all the 10 nodes in "free" state. So, they are all<br>
accessible.<br>
<br>
Thanks,<br>
Kunal<br>
<div><div><br>
<br>
On 5/31/12, Ju JiaJia <<a href="mailto:jujj603@gmail.com" target="_blank">jujj603@gmail.com</a>> wrote:<br>
> Please give your queue/server configuration and your job's resources need,<br>
> cpu/memory etc. And Does all the 10 nodes accessable? You can use pbsnodes<br>
> to check this.<br>
><br>
> On Thu, May 31, 2012 at 10:53 PM, Kunal Rao <<a href="mailto:kunalgrao@gmail.com" target="_blank">kunalgrao@gmail.com</a>> wrote:<br>
><br>
>> Hello,<br>
>><br>
>> Please see the below message. I had posted it on maui users mailing list,<br>
>> but did not get any response, so thought of posting it here on torque<br>
>> users<br>
>> mailing list (incase someone would know). Kindly let me know if you have<br>
>> any comments / ideas / suggestions.<br>
>><br>
>> Thanks,<br>
>> Kunal<br>
>><br>
>> ---------- Forwarded message ----------<br>
>> From: Kunal Rao <<a href="mailto:kunalgrao@gmail.com" target="_blank">kunalgrao@gmail.com</a>><br>
>> Date: Wed, May 23, 2012 at 2:30 PM<br>
>> Subject: Re: Multi-req job not starting<br>
>> To: <a href="mailto:mauiusers@supercluster.org" target="_blank">mauiusers@supercluster.org</a><br>
>><br>
>><br>
>> There was a similar post earlier :<br>
>> <a href="http://www.clusterresources.com/pipermail/mauiusers/2009-July/003930.html" target="_blank">http://www.clusterresources.com/pipermail/mauiusers/2009-July/003930.html</a><br>
>><br>
>> But did not find any response to it. Can anyone please provide some ideas<br>
>> / suggestion on this issue.<br>
>><br>
>> Thanks,<br>
>> Kunal<br>
>><br>
>><br>
>> On Wed, May 23, 2012 at 2:26 PM, Kunal Rao <<a href="mailto:kunalgrao@gmail.com" target="_blank">kunalgrao@gmail.com</a>> wrote:<br>
>><br>
>>> Hello,<br>
>>><br>
>>> I have a 10 node cluster. There are 3 jobs. 1 which needs 2 nodes ( with<br>
>>> 1 task per node ), another which needs 4 nodes (with 1 task per node)<br>
>>> and<br>
>>> the third one which needs 4 nodes ( with 2 task on 1 node and 1 task<br>
>>> each<br>
>>> on the other 3 nodes ).<br>
>>><br>
>>> Additional configuration in maui.cfg is :<br>
>>><br>
>>> BACKFILLPOLICY FIRSTFIT<br>
>>> RESERVATIONPOLICY CURRENTHIGHEST<br>
>>><br>
>>> ENABLEMULTIREQJOBS TRUE<br>
>>> NODEALLOCATIONPOLICY MINRESOURCE<br>
>>> NODEACCESSPOLICY SINGLEJOB<br>
>>> JOBNODEMATCHPOLICY EXACTNODE<br>
>>><br>
>>> I am observing that if the first 2 jobs are running, the third one does<br>
>>> not start ( even though 4 nodes are available ) until 1 of the jobs<br>
>>> complete. With checkjob -v <job_id> it shows the following output :<br>
>>><br>
>>> ------------------<br>
>>><br>
>>> checking job 5791 (RM job '5791.fire16.csa.local')<br>
>>><br>
>>> State: Idle<br>
>>> Creds: user:kunal group:kunal class:batch qos:DEFAULT<br>
>>> WallTime: 00:00:00 of 00:04:51<br>
>>> SubmitTime: Wed May 23 11:52:04<br>
>>> (Time Queued Total: 00:48:52 Eligible: 00:48:52)<br>
>>><br>
>>> StartDate: 00:00:01 Wed May 23 12:40:57<br>
>>> Total Tasks: 2<br>
>>><br>
>>> Req[0] TaskCount: 2 Partition: ALL<br>
>>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0<br>
>>> Opsys: [NONE] Arch: [NONE] Features: [NONE]<br>
>>> Exec: '' ExecSize: 0 ImageSize: 0<br>
>>> Dedicated Resources Per Task: PROCS: 1<br>
>>> NodeAccess: SINGLEJOB<br>
>>> TasksPerNode: 2 NodeCount: 1<br>
>>><br>
>>> Req[1] TaskCount: 3 Partition: ALL<br>
>>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0<br>
>>> Opsys: [NONE] Arch: [NONE] Features: [NONE]<br>
>>> Exec: '' ExecSize: 0 ImageSize: 0<br>
>>> Dedicated Resources Per Task: PROCS: 1<br>
>>> NodeAccess: SINGLEJOB<br>
>>> NodeCount: 3<br>
>>><br>
>>><br>
>>> IWD: [NONE] Executable: [NONE]<br>
>>> Bypass: 5 StartCount: 0<br>
>>> PartitionMask: [ALL]<br>
>>> Flags: RESTARTABLE<br>
>>><br>
>>> Reservation '5791' (00:00:01 -> 00:04:52 Duration: 00:04:51)<br>
>>> PE: 5.00 StartPriority: 48<br>
>>> cannot select job 5791 for partition DEFAULT (startdate in '00:00:01')<br>
>>><br>
>>> ------------<br>
>>><br>
>>> What could be the reason for not starting this job ? How do I resolve<br>
>>> this ?<br>
>>><br>
>>> Thanks,<br>
>>> Kunal<br>
>>><br>
>><br>
>><br>
>><br>
>> _______________________________________________<br>
>> torqueusers mailing list<br>
>> <a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a><br>
>> <a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
>><br>
>><br>
><br>
_______________________________________________<br>
torqueusers mailing list<br>
<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a><br>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
</div></div></blockquote></div><br>
</div></div><br>_______________________________________________<br>
torqueusers mailing list<br>
<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a><br>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
<br></blockquote></div><br></div></div></div></div>
<br>_______________________________________________<br>
torqueusers mailing list<br>
<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a><br>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
<br></blockquote></div><br>
</div></div><br>_______________________________________________<br>
torqueusers mailing list<br>
<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a><br>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
<br></blockquote></div><br></div></div></div>
</blockquote></div><br>
</div></div></blockquote></div><br>