<font size=2 face="sans-serif">When I did what you recommended </font>
<br>
<br><font size=2 face="sans-serif">qsub -I -l procs=48</font>
<br>
<br><font size=2 face="sans-serif">my node file has only one entry in it</font>
<br>
<br><font size=2 face="sans-serif">eos { ~ }$ cat $PBS_NODEFILE</font>
<br><font size=2 face="sans-serif">eos</font>
<br>
<br><font size=2 face="sans-serif">I need a node file with one entry for
each processor. I also want to be able to specify chunks of resources (ie
nodes=6:ppn=4) since I have some 4 and 8 core machines and I don't want
to get less than four procs on a machine.</font>
<br>
<br><font size=2 face="sans-serif">Reading the admin sdocumentation below
from section 10.1.7 as listed below suggests that I s</font>
<br><font size=2 face="sans-serif">qsub will not allow the submission of
jobs requesting many processors</font>
<br><font size=2 face="sans-serif">TORQUE's definition of a node is context
sensitive and can appear inconsistent. The qsub '-l</font>
<br><font size=2 face="sans-serif">nodes=<X>' expression can at times
indicate a request for X processors and other time be</font>
<br><font size=2 face="sans-serif">interpreted as a request for X nodes.
While qsub allows multiple interpretations of the keyword</font>
<br><font size=2 face="sans-serif">nodes, aspects of the TORQUE server's
logic are not so flexible. Consequently, if a job is using '-</font>
<br><font size=2 face="sans-serif">l nodes' to specify processor count
and the requested number of processors exceeds the available</font>
<br><font size=2 face="sans-serif">number of physical nodes, the server
daemon will reject the job.</font>
<br><font size=2 face="sans-serif">To get around this issue, the server
can be told it has an inflated number of nodes using the</font>
<br><font size=2 face="sans-serif">resources_available attribute. To take
affect, this attribute should be set on both the server and</font>
<br><font size=2 face="sans-serif">the associated queue as in the example
below. See resources_available for more information.</font>
<br>
<br><font size=2 face="sans-serif">> qmgr</font>
<br><font size=2 face="sans-serif">Qmgr: set server resources_available.nodect=2048</font>
<br><font size=2 face="sans-serif">Qmgr: set queue batch resources_available.nodect=2048</font>
<br>
<br><font size=2 face="sans-serif">NOTE: The pbs_server daemon will need
to be restarted before these changes will take affect.</font>
<br>
<br><font size=2 face="sans-serif">Any Ideas?</font>
<br>
<br><font size=2 face="sans-serif">Thanks,<br>
<br>
Jon Shelley<br>
HPC Software Consultant<br>
Idaho National Lab<br>
Phone (208) 526-9834<br>
Fax (208) 526-0122<br>
</font>
<br>
<br>
<br>
<table width=100%>
<tr valign=top>
<td width=40%><font size=1 face="sans-serif"><b>Roman Baranowski <roman@chem.ubc.ca></b>
</font>
<br><font size=1 face="sans-serif">Sent by: torqueusers-bounces@supercluster.org</font>
<p><font size=1 face="sans-serif">03/02/2010 06:49 PM</font>
<td width=59%>
<table width=100%>
<tr valign=top>
<td>
<div align=right><font size=1 face="sans-serif">To</font></div>
<td><font size=1 face="sans-serif">Jonathan K Shelley <Jonathan.Shelley@inl.gov></font>
<tr valign=top>
<td>
<div align=right><font size=1 face="sans-serif">cc</font></div>
<td><font size=1 face="sans-serif">torqueusers <torqueusers@supercluster.org></font>
<tr valign=top>
<td>
<div align=right><font size=1 face="sans-serif">Subject</font></div>
<td><font size=1 face="sans-serif">Re: [torqueusers] Job with high proc
count will not schedule</font></table>
<br>
<table>
<tr valign=top>
<td>
<td></table>
<br></table>
<br>
<br>
<br><tt><font size=2><br>
Dear
Jonathan,<br>
<br>
You have 5 nodes only so bumping up the resources_availbale.nodect with
<br>
qmgr will never work, have you tried<br>
qsub
-I -l procs=112<br>
<br>
All
the best<br>
Roman<br>
<br>
<br>
On Tue, 2 Mar 2010, Jonathan K Shelley wrote:<br>
<br>
> I have a 5 node cluster with 112 cores. I just installed torque 2.4.6.
It seems to be working but when<br>
> I submit the following.<br>
> <br>
> qsub -I -l nodes=32<br>
> qsub: waiting for job 551.eos.inel.gov to start<br>
> <br>
> I try a qrun and I get the following:<br>
> <br>
> eos:/opt/torque/sbin # qrun 551<br>
> qrun: Resource temporarily unavailable MSG=job allocation request
exceeds currently available cluster<br>
> nodes, 32 requested, 5 available 551.eos.inel.gov<br>
> <br>
> but it never schedules. I saw in the documentation that I needed to
set the resources_availbale.nodect<br>
> to a high number so I did.<br>
> <br>
> when I run printserverdb I get:<br>
> <br>
> eos:/opt/torque/sbin # printserverdb<br>
> ---------------------------------------------------<br>
> numjobs: 0<br>
> numque: 1<br>
> jobidnumber: 552<br>
> sametm: 1267574146<br>
> --attributes--<br>
> total_jobs = 1<br>
> state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0<br>
> default_queue = all<br>
> log_events = 511<br>
> mail_from = adm<br>
> query_other_jobs = True<br>
> resources_available.nodect = 2048<br>
> scheduler_iteration = 600<br>
> node_check_rate = 150<br>
> tcp_timeout = 6<br>
> pbs_version = 2.4.6<br>
> next_job_number = 551<br>
> net_counter = 3 0 0<br>
> <br>
> eos:/opt/torque/sbin # qmgr -c "p s"<br>
> #<br>
> # Create queues and set their attributes.<br>
> #<br>
> #<br>
> # Create and define queue all<br>
> #<br>
> create queue all<br>
> set queue all queue_type = Execution<br>
> set queue all resources_max.walltime = 672:00:00<br>
> set queue all resources_available.nodect = 2048<br>
> set queue all enabled = True<br>
> set queue all started = True<br>
> #<br>
> # Set server attributes.<br>
> #<br>
> set server acl_hosts = eos<br>
> set server managers = awm@eos.inel.gov<br>
> set server managers += lucads2@eos.inel.gov<br>
> set server managers += poolrl@eos.inel.gov<br>
> set server managers += ''@eos.inel.gov<br>
> set server default_queue = all<br>
> set server log_events = 511<br>
> set server mail_from = adm<br>
> set server query_other_jobs = True<br>
> set server resources_available.nodect = 2048<br>
> set server scheduler_iteration = 600<br>
> set server node_check_rate = 150<br>
> set server tcp_timeout = 6<br>
> set server next_job_number = 552<br>
> <br>
> Any ideas what I need to do to get this working?<br>
> <br>
> Thanks,<br>
> <br>
> Jon Shelley<br>
> HPC Software Consultant<br>
> Idaho National Lab<br>
> Phone (208) 526-9834<br>
> Fax (208) 526-0122<br>
> <br>
>_______________________________________________<br>
torqueusers mailing list<br>
torqueusers@supercluster.org<br>
</font></tt><a href=http://www.supercluster.org/mailman/listinfo/torqueusers><tt><font size=2>http://www.supercluster.org/mailman/listinfo/torqueusers</font></tt></a><tt><font size=2><br>
</font></tt>
<br>