<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:p="urn:schemas-microsoft-com:office:powerpoint" xmlns:a="urn:schemas-microsoft-com:office:access" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" xmlns:rs="urn:schemas-microsoft-com:rowset" xmlns:z="#RowsetSchema" xmlns:b="urn:schemas-microsoft-com:office:publisher" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:c="urn:schemas-microsoft-com:office:component:spreadsheet" xmlns:odc="urn:schemas-microsoft-com:office:odc" xmlns:oa="urn:schemas-microsoft-com:office:activation" xmlns:html="http://www.w3.org/TR/REC-html40" xmlns:q="http://schemas.xmlsoap.org/soap/envelope/" xmlns:rtc="http://microsoft.com/officenet/conferencing" xmlns:D="DAV:" xmlns:Repl="http://schemas.microsoft.com/repl/" xmlns:mt="http://schemas.microsoft.com/sharepoint/soap/meetings/" xmlns:x2="http://schemas.microsoft.com/office/excel/2003/xml" xmlns:ppda="http://www.passport.com/NameSpace.xsd" xmlns:ois="http://schemas.microsoft.com/sharepoint/soap/ois/" xmlns:dir="http://schemas.microsoft.com/sharepoint/soap/directory/" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:dsp="http://schemas.microsoft.com/sharepoint/dsp" xmlns:udc="http://schemas.microsoft.com/data/udc" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:sub="http://schemas.microsoft.com/sharepoint/soap/2002/1/alerts/" xmlns:ec="http://www.w3.org/2001/04/xmlenc#" xmlns:sp="http://schemas.microsoft.com/sharepoint/" xmlns:sps="http://schemas.microsoft.com/sharepoint/soap/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:udcs="http://schemas.microsoft.com/data/udc/soap" xmlns:udcxf="http://schemas.microsoft.com/data/udc/xmlfile" xmlns:udcp2p="http://schemas.microsoft.com/data/udc/parttopart" xmlns:wf="http://schemas.microsoft.com/sharepoint/soap/workflow/" xmlns:dsss="http://schemas.microsoft.com/office/2006/digsig-setup" xmlns:dssi="http://schemas.microsoft.com/office/2006/digsig" xmlns:mdssi="http://schemas.openxmlformats.org/package/2006/digital-signature" xmlns:mver="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns:mrels="http://schemas.openxmlformats.org/package/2006/relationships" xmlns:spwp="http://microsoft.com/sharepoint/webpartpages" xmlns:ex12t="http://schemas.microsoft.com/exchange/services/2006/types" xmlns:ex12m="http://schemas.microsoft.com/exchange/services/2006/messages" xmlns:pptsl="http://schemas.microsoft.com/sharepoint/soap/SlideLibrary/" xmlns:spsl="http://microsoft.com/webservices/SharePointPortalServer/PublishedLinksService" xmlns:Z="urn:schemas-microsoft-com:" xmlns:st="" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 12 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Tahoma;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman","serif";
        color:black;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
span.EmailStyle17
        {mso-style-type:personal;
        font-family:"Calibri","sans-serif";
        color:#1F497D;}
span.EmailStyle18
        {mso-style-type:personal-reply;
        font-family:"Calibri","sans-serif";
        color:#1F497D;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body bgcolor=white lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Hello,<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Diagnose –p was truncated. I was hoping to see that 33-35 (Queued) did not have a large QTime which may be increasing their priority higher than your job 38. That could cause them to make job 38 wait even though they are not running. Sounds doubtful in your scenario but I’ve seen it cause issues before.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>If you delete the Q state jobs 33-35, does your 38 start?<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>We use the same preemption concept you’re trying to achieve but I’m having a hard time narrowing down the cause for your error. A few small differences with our configuration is the backfill policy and reservation policy. You might try these settings and then restart maui:<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>BACKFILLPOLICY BESTFIT<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>RESERVATIONPOLICY CURRENTHIGHEST<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><div><div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'><p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif";color:windowtext'>From:</span></b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif";color:windowtext'> Joseph Farran [mailto:jfarran@uci.edu] <br><b>Sent:</b> Thursday, February 16, 2012 11:48 AM<br><b>To:</b> Edsall, William (WJ); mauiusers@supercluster.org<br><b>Subject:</b> Re: [Mauiusers] Backfill Jobs prevent PREEMPTOR jobs from running?<o:p></o:p></span></p></div></div><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><span style='font-size:13.5pt'>Hi Edsall.<br><br>Thank you for responding. I have a few more nodes now, but the same configuration. I am including the diagnose -p with other details:<br><br>We have 13 64-core nodes. All nodes have the 'free' feature and a queue named 'free' as PREEMPTEE so that we can harvest idle cycles when the nodes are not in use by their owners.<br><br>As user "juser", I load up the 'free' queue (PREEMTEE) as follows:<br><br>1.hpc.cluster. juser free test 24904 1 63 -- 72:00 R 00:01<br>2.hpc.cluster. juser free test 29346 1 63 -- 72:00 R 00:01<br>3.hpc.cluster. juser free test 42900 1 63 -- 72:00 R 00:01<br>4.hpc.cluster. juser free test 30291 1 63 -- 72:00 R 00:01<br>5.hpc.cluster. juser free test 26417 1 63 -- 72:00 R 00:01<br>6.hpc.cluster. juser free test 40206 1 63 -- 72:00 R 00:01<br>7.hpc.cluster. juser free test 1786 1 63 -- 72:00 R 00:01<br>8.hpc.cluster. juser free test 62436 1 63 -- 72:00 R 00:01<br>9.hpc.cluster. juser free test 49087 1 63 -- 72:00 R 00:01<br>10.hpc.cluster juser free test 45691 1 63 -- 72:00 R 00:01<br>11.hpc.cluster juser free test 41386 1 63 -- 72:00 R 00:01<br>12.hpc.cluster juser free test 35204 1 63 -- 72:00 R 00:01<br>13.hpc.cluster juser free test 51043 1 63 -- 72:00 R 00:01<br>14.hpc.cluster juser free test 24948 1 1 -- 72:00 R 00:01<br>15.hpc.cluster juser free test 29390 1 1 -- 72:00 R 00:01<br>16.hpc.cluster juser free test 42944 1 1 -- 72:00 R 00:01<br>17.hpc.cluster juser free test 30335 1 1 -- 72:00 R 00:01<br>18.hpc.cluster juser free test 26461 1 1 -- 72:00 R 00:01<br>19.hpc.cluster juser free test 40250 1 1 -- 72:00 R 00:01<br>20.hpc.cluster juser free test 1830 1 1 -- 72:00 R 00:01<br>21.hpc.cluster juser free test 62480 1 1 -- 72:00 R 00:01<br>22.hpc.cluster juser free test 49131 1 1 -- 72:00 R 00:01<br>23.hpc.cluster juser free test 45735 1 1 -- 72:00 R 00:01<br>24.hpc.cluster juser free test 41430 1 1 -- 72:00 R 00:01<br>25.hpc.cluster juser free test 35248 1 1 -- 72:00 R 00:01<br>26.hpc.cluster juser free test 51087 1 1 -- 72:00 R 00:01<br>27.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>28.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>29.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>30.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>31.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>32.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>33.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>34.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>35.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br><br>As user "tw" which owes the 'tw' nodes, I run:<br><br> qsub -I -q tw -l nodes=6:ppn=64<br><br>And preeption works as expected:<br><br>1.hpc.cluster. juser free test 24904 1 63 -- 72:00 R 00:02<br>2.hpc.cluster. juser free test 29346 1 63 -- 72:00 S 00:01<br>3.hpc.cluster. juser free test 42900 1 63 -- 72:00 S 00:01<br>4.hpc.cluster. juser free test 30291 1 63 -- 72:00 S 00:01<br>5.hpc.cluster. juser free test 26417 1 63 -- 72:00 S 00:01<br>6.hpc.cluster. juser free test 40206 1 63 -- 72:00 S 00:01<br>7.hpc.cluster. juser free test 1786 1 63 -- 72:00 S 00:01<br>8.hpc.cluster. juser free test 62436 1 63 -- 72:00 R 00:01<br>9.hpc.cluster. juser free test 49087 1 63 -- 72:00 R 00:02<br>10.hpc.cluster juser free test 45691 1 63 -- 72:00 R 00:02<br>11.hpc.cluster juser free test 41386 1 63 -- 72:00 R 00:02<br>12.hpc.cluster juser free test 35204 1 63 -- 72:00 R 00:02<br>13.hpc.cluster juser free test 51043 1 63 -- 72:00 R 00:02<br>14.hpc.cluster juser free test 24948 1 1 -- 72:00 R 00:02<br>15.hpc.cluster juser free test 29390 1 1 -- 72:00 S 00:02<br>16.hpc.cluster juser free test 42944 1 1 -- 72:00 S 00:01<br>17.hpc.cluster juser free test 30335 1 1 -- 72:00 S 00:01<br>18.hpc.cluster juser free test 26461 1 1 -- 72:00 S 00:01<br>19.hpc.cluster juser free test 40250 1 1 -- 72:00 S 00:01<br>20.hpc.cluster juser free test 1830 1 1 -- 72:00 S 00:01<br>21.hpc.cluster juser free test 62480 1 1 -- 72:00 R 00:02<br>22.hpc.cluster juser free test 49131 1 1 -- 72:00 R 00:02<br>23.hpc.cluster juser free test 45735 1 1 -- 72:00 R 00:02<br>24.hpc.cluster juser free test 41430 1 1 -- 72:00 R 00:02<br>25.hpc.cluster juser free test 35248 1 1 -- 72:00 R 00:02<br>26.hpc.cluster juser free test 51087 1 1 -- 72:00 R 00:02<br>27.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>28.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>29.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>30.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>31.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>32.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>33.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>34.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>35.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>36.hpc.cluster tw tw STDIN 30505 6 384 -- 99:00 R -- <br><br>As user 'tw', I exit and run the command:<br><br>qsub -I -q tw -l nodes=6:ppn=62<br><br>Everything works again as expected and Maui also starts 6 new 1-core jobs ( jobs 21 through 26 ):<br><br>1.hpc.cluster. juser free test 24904 1 63 -- 72:00 R 00:03<br>2.hpc.cluster. juser free test 29346 1 63 -- 72:00 S 00:01<br>3.hpc.cluster. juser free test 42900 1 63 -- 72:00 S 00:01<br>4.hpc.cluster. juser free test 30291 1 63 -- 72:00 S 00:01<br>5.hpc.cluster. juser free test 26417 1 63 -- 72:00 S 00:01<br>6.hpc.cluster. juser free test 40206 1 63 -- 72:00 S 00:02<br>7.hpc.cluster. juser free test 1786 1 63 -- 72:00 S 00:02<br>8.hpc.cluster. juser free test 62436 1 63 -- 72:00 R 00:03<br>9.hpc.cluster. juser free test 49087 1 63 -- 72:00 R 00:03<br>10.hpc.cluster juser free test 45691 1 63 -- 72:00 R 00:03<br>11.hpc.cluster juser free test 41386 1 63 -- 72:00 R 00:03<br>12.hpc.cluster juser free test 35204 1 63 -- 72:00 R 00:03<br>13.hpc.cluster juser free test 51043 1 63 -- 72:00 R 00:03<br>14.hpc.cluster juser free test 24948 1 1 -- 72:00 R 00:03<br>15.hpc.cluster juser free test 29390 1 1 -- 72:00 R 00:02<br>16.hpc.cluster juser free test 42944 1 1 -- 72:00 R 00:02<br>17.hpc.cluster juser free test 30335 1 1 -- 72:00 R 00:02<br>18.hpc.cluster juser free test 26461 1 1 -- 72:00 R 00:02<br>19.hpc.cluster juser free test 40250 1 1 -- 72:00 R 00:02<br>20.hpc.cluster juser free test 1830 1 1 -- 72:00 R 00:02<br>21.hpc.cluster juser free test 62480 1 1 -- 72:00 R 00:03<br>22.hpc.cluster juser free test 49131 1 1 -- 72:00 R 00:03<br>23.hpc.cluster juser free test 45735 1 1 -- 72:00 R 00:03<br>24.hpc.cluster juser free test 41430 1 1 -- 72:00 R 00:03<br>25.hpc.cluster juser free test 35248 1 1 -- 72:00 R 00:03<br>26.hpc.cluster juser free test 51087 1 1 -- 72:00 R 00:03<br>27.hpc.cluster juser free test 30749 1 1 -- 72:00 R -- <br>28.hpc.cluster juser free test 44220 1 1 -- 72:00 R -- <br>29.hpc.cluster juser free test 31513 1 1 -- 72:00 R -- <br>30.hpc.cluster juser free test 27736 1 1 -- 72:00 R -- <br>31.hpc.cluster juser free test 41429 1 1 -- 72:00 R -- <br>32.hpc.cluster juser free test 3130 1 1 -- 72:00 R -- <br>33.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>34.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>35.hpc.cluster juser free test -- 1 1 -- 72:00 Q -- <br>37.hpc.cluster tw tw STDIN 30708 6 372 -- 99:00 R -- <br><br>However, if I now exit and go back and try to get 6 of the 64-core nodes (which worked before) I cannot. Maui will not preempt the new jobs it started.<br><br>My new job 38 below just sits in the queue:<br><br>$ qsub -I -q tw -l nodes=6:ppn=64<br>qsub: waiting for job 38.hpc.cluster to start<br><br># diagnose -p <br>diagnosing job priority information (partition: ALL)<br><br>Job PRIORITY* Cred( QOS) Serv(QTime)<br> Weights -------- 100( 1000) 1( 1)<br><br>38 100000109 100.0(1000.) 0.0(109.4)<br>2 5 0.0( 0.0) 100.0( 5.3)<br>3 5 0.0( 0.0) 100.0( 5.3)<br>4 5 0.0( 0.0) 100.0( 5.3)<br>5 5 0.0( 0.0) 100.0( 5.3)<br>6 5 0.0( 0.0) 100.0( 5.3)<br>7 5 0.0( 0.0) 100.0( 5.3)<br><br>Percent Contribution -------- 100.0(100.0) 0.0( 0.0)<br><br>[root@mpc-x maui]# checkjob -v 38<br><br><br>checking job 38 (RM job '38.hpc.cluster')<br><br>State: Idle<br>Creds: user:tw group:tw class:tw qos:high<br>WallTime: 00:00:00 of 4:03:00:00<br>SubmitTime: Thu Feb 16 08:26:31<br> (Time Queued Total: 00:01:37 Eligible: 00:01:37)<br><br>Total Tasks: 384<br><br>Req[0] TaskCount: 384 Partition: ALL<br>Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0<br>Opsys: [NONE] Arch: [NONE] Features: [tw]<br>Exec: '' ExecSize: 0 ImageSize: 0<br>Dedicated Resources Per Task: PROCS: 1<br>NodeAccess: SHARED<br>TasksPerNode: 64 NodeCount: 6<br><br><br>IWD: [NONE] Executable: [NONE]<br>Bypass: 0 StartCount: 0<br>PartitionMask: [ALL]<br>Flags: PREEMPTOR<br><br>Reservation '38' (2:23:58:22 -> 7:02:58:22 Duration: 4:03:00:00)<br>PE: 384.00 StartPriority: 100000163<br>job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 384 procs found)<br>idle procs: 384 feasible procs: 0<br><br>Rejection Reasons: [Features : 7][CPU : 6]<br><br>Detailed Node Availability Information:<br><br>compute-1-1 rejected : Features<br>compute-1-2 rejected : Features<br>compute-1-3 rejected : Features<br>compute-1-4 rejected : Features<br>compute-1-5 rejected : Features<br>compute-1-6 rejected : Features<br>compute-1-7 rejected : CPU<br>compute-1-8 rejected : CPU<br>compute-1-9 rejected : CPU<br>compute-1-10 rejected : CPU<br>compute-1-11 rejected : CPU<br>compute-1-12 rejected : CPU<br>compute-1-13 rejected : Features<br><br>-------------------------------------------------------<br>Here is my PBS nodes file:<br><br># cat /opt/torque/server_priv/nodes <br>compute-1-1 np=64 sf free<br>compute-1-2 np=64 sf free<br>compute-1-3 np=64 sf free<br>compute-1-4 np=64 chem free<br>compute-1-5 np=64 chem free<br>compute-1-6 np=64 chem free<br>compute-1-7 np=64 tw free<br>compute-1-8 np=64 tw free<br>compute-1-9 np=64 tw free<br>compute-1-10 np=64 tw free<br>compute-1-11 np=64 tw free<br>compute-1-12 np=64 tw free<br>compute-1-13 np=64 bio free<br><br>------------------------------------<br><br></span><br>Edsall, William (WJ) wrote: <o:p></o:p></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Hi,</span><o:p></o:p></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>What does diagnose –p say about the priority of the jobs you expect to be preempted? Priority may take precedence over preemptability. </span><o:p></o:p></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'> </span><o:p></o:p></p><div><div style='border:none;border-top:solid windowtext 1.0pt;padding:3.0pt 0in 0in 0in;border-color:-moz-use-text-color -moz-use-text-color'><p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif";color:windowtext'>From:</span></b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif";color:windowtext'> <a href="mailto:mauiusers-bounces@supercluster.org">mauiusers-bounces@supercluster.org</a> [<a href="mailto:mauiusers-bounces@supercluster.org">mailto:mauiusers-bounces@supercluster.org</a>] <b>On Behalf Of </b>Joseph Farran<br><b>Sent:</b> Monday, February 13, 2012 3:19 PM<br><b>To:</b> <a href="mailto:mauiusers@supercluster.org">mauiusers@supercluster.org</a><br><b>Subject:</b> [Mauiusers] Backfill Jobs prevent PREEMPTOR jobs from running?</span><o:p></o:p></p></div></div><p class=MsoNormal style='margin-bottom:12.0pt'> <span style='font-size:13.5pt'><br><br></span><o:p></o:p></p></div></body></html>