<div dir="ltr">Brian,<div><br></div><div>What are your qmgr settings? Do you have a slot limit set there?</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Sep 18, 2013 at 3:34 PM, Andrus, Brian Contractor <span dir="ltr">&lt;<a href="mailto:bdandrus@nps.edu" target="_blank">bdandrus@nps.edu</a>&gt;</span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">





<div lang="EN-US" link="blue" vlink="purple">
<div>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">That didn’t clear it up.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">I did find is that on one of my nodes it showed the job id as 20139590[]<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">(note the missing arrayid)<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">There were only 4 jobs from the array on that node, along with some other jobs. I tagged the node offline, let the jobs drain (although it still showed the
 entire array job) and the ran pbs_mom purge.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">After that, I restarted pbs_server and it cleared up.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">Of course, now I cannot run any of the jobs that were blocked because “qrun: Execution server rejected request MSG=connection to mom timed out 20139590[1561].hamming.hamming.cluster“<u></u><u></u></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">It seems that those jobs want to run on that particular node and nowhere else, but the node is up and happy. It runs other jobs just fine.<u></u><u></u></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">I do tend to have difficulties with array jobs and torque. Lots of idiosyncrasies there.<u></u><u></u></span></p>
<div class="im">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">Brian Andrus<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">ITACS/Research Computing<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">Naval Postgraduate School<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">Monterey, California<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">voice: <a href="tel:831-656-6238" value="+18316566238" target="_blank">831-656-6238</a><u></u><u></u></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u> <u></u></span></p>
</div><div style="border:none;border-left:solid blue 1.5pt;padding:0in 0in 0in 4.0pt">
<div>
<div style="border:none;border-top:solid #b5c4df 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> <a href="mailto:torqueusers-bounces@supercluster.org" target="_blank">torqueusers-bounces@supercluster.org</a> [mailto:<a href="mailto:torqueusers-bounces@supercluster.org" target="_blank">torqueusers-bounces@supercluster.org</a>]
<b>On Behalf Of </b>Ken Nielson<br>
<b>Sent:</b> Wednesday, September 18, 2013 9:42 AM<br>
<b>To:</b> Torque Users Mailing List<br>
<b>Subject:</b> Re: [torqueusers] Slot limit unmatched<u></u><u></u></span></p>
</div>
</div><div><div class="h5">
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<div>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt">Brian,<u></u><u></u></p>
</div>
<p class="MsoNormal" style="margin-bottom:12.0pt">That is a problem. I wonder if you restart pbs_server if the slot limit problem clears up. If so it sounds like we have a counting problem in TORQUE.<u></u><u></u></p>
</div>
<p class="MsoNormal">Regards<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><u></u> <u></u></p>
<div>
<p class="MsoNormal">On Wed, Sep 18, 2013 at 9:15 AM, Andrus, Brian Contractor &lt;<a href="mailto:bdandrus@nps.edu" target="_blank">bdandrus@nps.edu</a>&gt; wrote:<u></u><u></u></p>
<p class="MsoNormal">All,<br>
<br>
I am running torque 4.2.5<br>
I have a user who submitted an array job of ~2500 jobs<br>
I have &#39;set server max_slot_limit = 512&#39;<br>
<br>
But...<br>
There are only 8 of his jobs running, the others are blocked because they sat so long.<br>
Yet if I try to qrun one of them, I get:<br>
        qrun: Invalid request MSG=Cannot run job. Array slot limit is 512 and there are already 512 jobs running<br>
<br>
Why does torque think there are 512 slots currently in use when there are only 8?<br>
<br>
<br>
Brian Andrus<br>
ITACS/Research Computing<br>
Naval Postgraduate School<br>
Monterey, California<br>
voice: <a href="tel:831-656-6238" target="_blank">831-656-6238</a><br>
<br>
<br>
_______________________________________________<br>
torqueusers mailing list<br>
<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a><br>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><u></u><u></u></p>
</div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><br>
<br clear="all">
<br>
-- <br>
Ken Nielson<br>
<a href="tel:%2B1%20801.717.3700" value="+18017173700" target="_blank">+1 801.717.3700</a> office <a href="tel:%2B1%20801.717.3738" value="+18017173738" target="_blank">+1 801.717.3738</a> fax<br>
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606<br>
<a href="http://www.adaptivecomputing.com" target="_blank">www.adaptivecomputing.com</a><u></u><u></u></p>
</div>
</div></div></div>
</div>
</div>

<br>_______________________________________________<br>
torqueusers mailing list<br>
<a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a><br>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div>David Beer | Senior Software Engineer</div><div>Adaptive Computing</div>
</div>