[torqueusers] torque/maui assigning jobs to full nodes when other nodes are free

Paul Raines raines at nmr.mgh.harvard.edu
Thu Jul 12 08:48:47 MDT 2012


And another followup, I am now getting new jobs submitted getting deferred
because they are assigned to nodes that the jobs I 'qrun'ed were run on

===========================================================
[root at launchpad ~]# checkjob 1850


checking job 1850

State: Idle  EState: Deferred
Creds:  user:lzollei  group:lzollei  class:default  qos:DEFAULT
WallTime: 00:00:00 of 4:00:00:00
SubmitTime: Thu Jul 12 10:37:49
   (Time Queued  Total: 00:08:40  Eligible: 00:00:01)

StartDate: -00:08:38  Thu Jul 12 10:37:51
Total Tasks: 5

Req[0]  TaskCount: 5  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [nonGPU]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, rc: 
15046, msg: 'Resource temporarily unavailable REJHOST=compute-0-16 MSG=cannot 
allocate node 'compute-0-16' to job - node not currently available (nps 
needed/free: 5/4, gpus needed/free: 0/0, joblist: 
1713.launchpad.nmr.mgh.harvard.edu:0,1713.launchpad.nmr.mgh.harvard.edu:1,1713.launchpad.nmr.mgh.harvard.edu:2,1713.launchpad.nmr.mgh.harvard.edu:3)')
Holds:    Defer  (hold reason:  RMFailure)
PE:  5.00  StartPriority:  101000
cannot select job 1850 for partition DEFAULT (job hold active)

===================================================

It is like maui is not getting the memo about where jobs are getting run
so what nodes are free.

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Thu, 12 Jul 2012 10:45am, Paul Raines wrote:

> As a followup, after running qrun on a job to get it to run on another node,
> maui still seems confused thinking it is still allocated to compute-0-6 as 
> this output shows:
>
> [root at launchpad ~]# checkjob 1713
>
>
> checking job 1713
>
> State: Running
> Creds:  user:award  group:award  class:p30  qos:DEFAULT
> WallTime: 00:06:16 of 4:00:00:00
> SubmitTime: Thu Jul 12 09:38:19
>  (Time Queued  Total: 00:57:50  Eligible: 00:00:00)
>
> StartTime: Thu Jul 12 10:36:09
> StartDate: -1:03:59  Thu Jul 12 09:38:20
> Total Tasks: 4
>
> Req[0]  TaskCount: 5  Partition: DEFAULT
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [nonGPU]
> NodeCount: 2
> Allocated Nodes:
> [compute-0-6:4][compute-0-16:1]
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [ALL]
> Reservation '1713' (-00:06:10 -> 3:23:53:50  Duration: 4:00:00:00)
> Messages:  cannot start job - RM failure, rc: 15046, msg: 'Resource 
> temporarily unavailable REJHOST=compute-0-6 MSG=cannot allocate node 
> 'compute-0-6' to job - node not currently available (nps needed/free: 4/3, 
> gpus needed/free: 0/0, joblist: 
> 1021.launchpad.nmr.mgh.harvard.edu:0,1021.launchpad.nmr.mgh.harvard.edu:1,1021.launchpad.nmr.mgh.harvard.edu:2,1021.launchpad.nmr.mgh.harvard.edu:3,1021.launchpad.nmr.mgh.harvard.edu:4)'
> PE:  5.00  StartPriority:  103003
>
> [root at launchpad ~]# qstat -n 1713
>
> launchpad.nmr.mgh.harvard.edu:
>                                                                         Req'd 
> Req'd   Elap
> Job ID               Username Queue    Jobname          SessID NDS   TSK 
> Memory Time  S Time
> -------------------- -------- -------- ---------------- ------ ----- --- 
> ------ ----- - -----
> 1713.launchpad.n     award    p30      pbsjob_1420       10808     1   4 
> -- 96:00 R 00:05
>   compute-0-16/3+compute-0-16/2+compute-0-16/1+compute-0-16/0
>
>
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>
>
>
> On Thu, 12 Jul 2012 10:39am, Paul Raines wrote:
>
>> 
>> I just did a total reinstall on our batch cluster upgrading all nodes
>> to CentOS6 and updating to torque-2.5.11 and maui-3.3.1
>> 
>> I have over 100 nodes and only a few jobs submitted so far but
>> somehow jobs are getting Deferred being assigned to nodes that
>> have jobs already running on them even though pleny of empty
>> free nodes exist.
>> 
>> ==========================================================
>> checking job 1710
>> 
>> State: Idle  EState: Deferred
>> Creds:  user:award  group:award  class:p30  qos:DEFAULT
>> WallTime: 00:00:00 of 4:00:00:00
>> SubmitTime: Thu Jul 12 09:38:18
>>  (Time Queued  Total: 00:50:31  Eligible: 00:00:00)
>> 
>> StartDate: -00:50:30  Thu Jul 12 09:38:19
>> Total Tasks: 4
>> 
>> Req[0]  TaskCount: 4  Partition: ALL
>> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>> Opsys: [NONE]  Arch: [NONE]  Features: [nonGPU]
>> 
>> 
>> IWD: [NONE]  Executable:  [NONE]
>> Bypass: 0  StartCount: 1
>> PartitionMask: [ALL]
>> job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, rc: 
>> 15046, msg: 'Resource temporarily unavailable REJHOST=compute-0-6 
>> MSG=cannot allocate node 'compute-0-6' to job - node not currently 
>> available (nps needed/free: 4/3, gpus needed/free: 0/0, joblist: 
>> 1021.launchpad.nmr.mgh.harvard.edu:0,1021.launchpad.nmr.mgh.harvard.edu:1,1021.launchpad.nmr.mgh.harvard.edu:2,1021.launchpad.nmr.mgh.harvard.edu:3,1021.launchpad.nmr.mgh.harvard.edu:4)')
>> Holds:    Defer  (hold reason:  RMFailure)
>> PE:  4.00  StartPriority:  103050
>> cannot select job 1710 for partition DEFAULT (job hold active)
>> ==========================================================
>> 
>> [root at launchpad ~]# pbsnodes -a compute-0-6
>> compute-0-6
>>     state = job-exclusive
>>     np = 8
>>     properties = nonGPU
>>     ntype = cluster
>>     jobs = 0/1021.launchpad.nmr.mgh.harvard.edu, 
>> 1/1021.launchpad.nmr.mgh.harvard.edu, 2/1021.launchpad.nmr.mgh.harvard.edu, 
>> 3/1021.launchpad.nmr.mgh.harvard.edu, 4/1021.launchpad.nmr.mgh.harvard.edu, 
>> 5/1754.launchpad.nmr.mgh.harvard.edu, 6/1816.launchpad.nmr.mgh.harvard.edu, 
>> 7/1806.launchpad.nmr.mgh.harvard.edu
>>     status = 
>> rectime=1342103360,varattr=,jobs=1021.launchpad.nmr.mgh.harvard.edu 
>> 1754.launchpad.nmr.mgh.harvard.edu 1806.launchpad.nmr.mgh.harvard.edu 
>> 1816.launchpad.nmr.mgh.harvard.edu,state=free,netload=65919428331,gres=,loadave=5.39,ncpus=8,physmem=32877888kb,availmem=86083428kb,totmem=99986744kb,idletime=143787,nusers=4,nsessions=5,sessions=4122 
>> 9023 27009 26961 28966,uname=Linux compute-0-6 2.6.32-220.23.1.el6.x86_64 
>> #1 SMP Mon Jun 18 18:58:52 BST 2012 x86_64,opsys=linux
>>     gpus = 0
>> 
>> ==========================================================
>> 
>> All these Deferred jobs are trying to run on compute-0-6
>> 
>> ====================================================
>> BLOCKED JOBS----------------
>> JOBNAME            USERNAME      STATE  PROC     WCLIMIT 
>> QUEUETIME
>> 
>> 1710                  award   Deferred     4  4:00:00:00  Thu Jul 12 
>> 09:38:18
>> 1714                  award   Deferred     4  4:00:00:00  Thu Jul 12 
>> 09:38:21
>> 1715                  award   Deferred     4  4:00:00:00  Thu Jul 12 
>> 09:38:22
>> 1716                  award   Deferred     4  4:00:00:00  Thu Jul 12 
>> 09:38:24
>> 1717                  award   Deferred     4  4:00:00:00  Thu Jul 12 
>> 09:38:25
>> 1718                  award   Deferred     4  4:00:00:00  Thu Jul 12 
>> 09:38:27
>> 1726                  tyler   Deferred     1  4:00:00:00  Thu Jul 12 
>> 09:40:46
>> 1761                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 
>> 09:57:36
>> 1764                  award   Deferred     4  4:00:00:00  Thu Jul 12 
>> 09:58:54
>> 1777                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 
>> 10:04:18
>> 1779                  tyler   Deferred     1  4:00:00:00  Thu Jul 12 
>> 10:04:36
>> 1784                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 
>> 10:07:39
>> 1791                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 
>> 10:11:00
>> 1803                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 
>> 10:17:43
>> 1814                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 
>> 10:21:04
>> ====================================================
>> 
>> Some jobs we submit still get run on other nodes just fine.  It seems
>> random what is getting assigned to compute-0-6 and then deferred.
>> 
>> There are lots of identical configured nodes free.  I can force these
>> jobs to run on other nodes with qrun by hand but what is going on?
>> 
>> Here is my maui config which worked fine in my older setup
>> ==========================================================
>> RMPOLLINTERVAL		00:00:30
>> SERVERHOST		launchpad.nmr.mgh.harvard.edu
>> SERVERPORT		40559
>> SERVERMODE		NORMAL
>> ADMINHOST		launchpad.nmr.mgh.harvard.edu
>> RMCFG[base]		TYPE=PBS
>> ADMIN1                maui root
>> ADMIN3                ALL
>> LOGFILE               /var/spool/maui/log/maui.log
>> LOGFILEMAXSIZE        1000000000
>> LOGLEVEL              3
>> QUEUETIMEWEIGHT       1
>> CLASSWEIGHT           10
>> USERCFG[DEFAULT] MAXIPROC=8
>> CLASSCFG[default] MAXPROCPERUSER=150
>> CLASSCFG[matlab] MAXPROCPERUSER=60
>> CLASSCFG[max10] MAXPROCPERUSER=10
>> CLASSCFG[max20] MAXPROCPERUSER=20
>> CLASSCFG[max50] MAXPROCPERUSER=50
>> CLASSCFG[max75] MAXPROCPERUSER=75
>> CLASSCFG[max100] MAXPROCPERUSER=100
>> CLASSCFG[max200] MAXPROCPERUSER=200
>> CLASSCFG[p5] MAXPROCPERUSER=5000
>> CLASSCFG[p10] MAXPROCPERUSER=5000
>> CLASSCFG[p20] MAXPROCPERUSER=5000
>> CLASSCFG[p30] MAXPROCPERUSER=5000
>> CLASSCFG[p40] MAXPROCPERUSER=5000
>> CLASSCFG[p50] MAXPROCPERUSER=30
>> CLASSCFG[p60] MAXPROCPERUSER=20
>> CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250
>> CLASSCFG[GPU] MAXPROCPERUSER=5000
>> BACKFILLPOLICY        FIRSTFIT
>> RESERVATIONPOLICY     CURRENTHIGHEST
>> NODEALLOCATIONPOLICY  PRIORITY
>> NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT'
>> ENFORCERESOURCELIMITS   OFF
>> ENABLEMULTIREQJOBS TRUE
>> ====================================================
>> 
>> There is nothing in the queue configs that would favor any nodes over
>> any other.
>> 
>> ---------------------------------------------------------------
>> Paul Raines                     http://help.nmr.mgh.harvard.edu
>> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
>> 149 (2301) 13th Street     Charlestown, MA 02129	    USA
>> 
>> 
>> 
>> 
>


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.



More information about the torqueusers mailing list