[torqueusers] torque/maui assigning jobs to full nodes when other nodes are free
Paul Raines
raines at nmr.mgh.harvard.edu
Thu Jul 12 08:48:47 MDT 2012
And another followup, I am now getting new jobs submitted getting deferred
because they are assigned to nodes that the jobs I 'qrun'ed were run on
===========================================================
[root at launchpad ~]# checkjob 1850
checking job 1850
State: Idle EState: Deferred
Creds: user:lzollei group:lzollei class:default qos:DEFAULT
WallTime: 00:00:00 of 4:00:00:00
SubmitTime: Thu Jul 12 10:37:49
(Time Queued Total: 00:08:40 Eligible: 00:00:01)
StartDate: -00:08:38 Thu Jul 12 10:37:51
Total Tasks: 5
Req[0] TaskCount: 5 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [nonGPU]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
job is deferred. Reason: RMFailure (cannot start job - RM failure, rc:
15046, msg: 'Resource temporarily unavailable REJHOST=compute-0-16 MSG=cannot
allocate node 'compute-0-16' to job - node not currently available (nps
needed/free: 5/4, gpus needed/free: 0/0, joblist:
1713.launchpad.nmr.mgh.harvard.edu:0,1713.launchpad.nmr.mgh.harvard.edu:1,1713.launchpad.nmr.mgh.harvard.edu:2,1713.launchpad.nmr.mgh.harvard.edu:3)')
Holds: Defer (hold reason: RMFailure)
PE: 5.00 StartPriority: 101000
cannot select job 1850 for partition DEFAULT (job hold active)
===================================================
It is like maui is not getting the memo about where jobs are getting run
so what nodes are free.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Thu, 12 Jul 2012 10:45am, Paul Raines wrote:
> As a followup, after running qrun on a job to get it to run on another node,
> maui still seems confused thinking it is still allocated to compute-0-6 as
> this output shows:
>
> [root at launchpad ~]# checkjob 1713
>
>
> checking job 1713
>
> State: Running
> Creds: user:award group:award class:p30 qos:DEFAULT
> WallTime: 00:06:16 of 4:00:00:00
> SubmitTime: Thu Jul 12 09:38:19
> (Time Queued Total: 00:57:50 Eligible: 00:00:00)
>
> StartTime: Thu Jul 12 10:36:09
> StartDate: -1:03:59 Thu Jul 12 09:38:20
> Total Tasks: 4
>
> Req[0] TaskCount: 5 Partition: DEFAULT
> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: [NONE] Arch: [NONE] Features: [nonGPU]
> NodeCount: 2
> Allocated Nodes:
> [compute-0-6:4][compute-0-16:1]
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 1
> PartitionMask: [ALL]
> Reservation '1713' (-00:06:10 -> 3:23:53:50 Duration: 4:00:00:00)
> Messages: cannot start job - RM failure, rc: 15046, msg: 'Resource
> temporarily unavailable REJHOST=compute-0-6 MSG=cannot allocate node
> 'compute-0-6' to job - node not currently available (nps needed/free: 4/3,
> gpus needed/free: 0/0, joblist:
> 1021.launchpad.nmr.mgh.harvard.edu:0,1021.launchpad.nmr.mgh.harvard.edu:1,1021.launchpad.nmr.mgh.harvard.edu:2,1021.launchpad.nmr.mgh.harvard.edu:3,1021.launchpad.nmr.mgh.harvard.edu:4)'
> PE: 5.00 StartPriority: 103003
>
> [root at launchpad ~]# qstat -n 1713
>
> launchpad.nmr.mgh.harvard.edu:
> Req'd
> Req'd Elap
> Job ID Username Queue Jobname SessID NDS TSK
> Memory Time S Time
> -------------------- -------- -------- ---------------- ------ ----- ---
> ------ ----- - -----
> 1713.launchpad.n award p30 pbsjob_1420 10808 1 4
> -- 96:00 R 00:05
> compute-0-16/3+compute-0-16/2+compute-0-16/1+compute-0-16/0
>
>
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>
>
>
> On Thu, 12 Jul 2012 10:39am, Paul Raines wrote:
>
>>
>> I just did a total reinstall on our batch cluster upgrading all nodes
>> to CentOS6 and updating to torque-2.5.11 and maui-3.3.1
>>
>> I have over 100 nodes and only a few jobs submitted so far but
>> somehow jobs are getting Deferred being assigned to nodes that
>> have jobs already running on them even though pleny of empty
>> free nodes exist.
>>
>> ==========================================================
>> checking job 1710
>>
>> State: Idle EState: Deferred
>> Creds: user:award group:award class:p30 qos:DEFAULT
>> WallTime: 00:00:00 of 4:00:00:00
>> SubmitTime: Thu Jul 12 09:38:18
>> (Time Queued Total: 00:50:31 Eligible: 00:00:00)
>>
>> StartDate: -00:50:30 Thu Jul 12 09:38:19
>> Total Tasks: 4
>>
>> Req[0] TaskCount: 4 Partition: ALL
>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
>> Opsys: [NONE] Arch: [NONE] Features: [nonGPU]
>>
>>
>> IWD: [NONE] Executable: [NONE]
>> Bypass: 0 StartCount: 1
>> PartitionMask: [ALL]
>> job is deferred. Reason: RMFailure (cannot start job - RM failure, rc:
>> 15046, msg: 'Resource temporarily unavailable REJHOST=compute-0-6
>> MSG=cannot allocate node 'compute-0-6' to job - node not currently
>> available (nps needed/free: 4/3, gpus needed/free: 0/0, joblist:
>> 1021.launchpad.nmr.mgh.harvard.edu:0,1021.launchpad.nmr.mgh.harvard.edu:1,1021.launchpad.nmr.mgh.harvard.edu:2,1021.launchpad.nmr.mgh.harvard.edu:3,1021.launchpad.nmr.mgh.harvard.edu:4)')
>> Holds: Defer (hold reason: RMFailure)
>> PE: 4.00 StartPriority: 103050
>> cannot select job 1710 for partition DEFAULT (job hold active)
>> ==========================================================
>>
>> [root at launchpad ~]# pbsnodes -a compute-0-6
>> compute-0-6
>> state = job-exclusive
>> np = 8
>> properties = nonGPU
>> ntype = cluster
>> jobs = 0/1021.launchpad.nmr.mgh.harvard.edu,
>> 1/1021.launchpad.nmr.mgh.harvard.edu, 2/1021.launchpad.nmr.mgh.harvard.edu,
>> 3/1021.launchpad.nmr.mgh.harvard.edu, 4/1021.launchpad.nmr.mgh.harvard.edu,
>> 5/1754.launchpad.nmr.mgh.harvard.edu, 6/1816.launchpad.nmr.mgh.harvard.edu,
>> 7/1806.launchpad.nmr.mgh.harvard.edu
>> status =
>> rectime=1342103360,varattr=,jobs=1021.launchpad.nmr.mgh.harvard.edu
>> 1754.launchpad.nmr.mgh.harvard.edu 1806.launchpad.nmr.mgh.harvard.edu
>> 1816.launchpad.nmr.mgh.harvard.edu,state=free,netload=65919428331,gres=,loadave=5.39,ncpus=8,physmem=32877888kb,availmem=86083428kb,totmem=99986744kb,idletime=143787,nusers=4,nsessions=5,sessions=4122
>> 9023 27009 26961 28966,uname=Linux compute-0-6 2.6.32-220.23.1.el6.x86_64
>> #1 SMP Mon Jun 18 18:58:52 BST 2012 x86_64,opsys=linux
>> gpus = 0
>>
>> ==========================================================
>>
>> All these Deferred jobs are trying to run on compute-0-6
>>
>> ====================================================
>> BLOCKED JOBS----------------
>> JOBNAME USERNAME STATE PROC WCLIMIT
>> QUEUETIME
>>
>> 1710 award Deferred 4 4:00:00:00 Thu Jul 12
>> 09:38:18
>> 1714 award Deferred 4 4:00:00:00 Thu Jul 12
>> 09:38:21
>> 1715 award Deferred 4 4:00:00:00 Thu Jul 12
>> 09:38:22
>> 1716 award Deferred 4 4:00:00:00 Thu Jul 12
>> 09:38:24
>> 1717 award Deferred 4 4:00:00:00 Thu Jul 12
>> 09:38:25
>> 1718 award Deferred 4 4:00:00:00 Thu Jul 12
>> 09:38:27
>> 1726 tyler Deferred 1 4:00:00:00 Thu Jul 12
>> 09:40:46
>> 1761 lzollei Deferred 5 4:00:00:00 Thu Jul 12
>> 09:57:36
>> 1764 award Deferred 4 4:00:00:00 Thu Jul 12
>> 09:58:54
>> 1777 lzollei Deferred 5 4:00:00:00 Thu Jul 12
>> 10:04:18
>> 1779 tyler Deferred 1 4:00:00:00 Thu Jul 12
>> 10:04:36
>> 1784 lzollei Deferred 5 4:00:00:00 Thu Jul 12
>> 10:07:39
>> 1791 lzollei Deferred 5 4:00:00:00 Thu Jul 12
>> 10:11:00
>> 1803 lzollei Deferred 5 4:00:00:00 Thu Jul 12
>> 10:17:43
>> 1814 lzollei Deferred 5 4:00:00:00 Thu Jul 12
>> 10:21:04
>> ====================================================
>>
>> Some jobs we submit still get run on other nodes just fine. It seems
>> random what is getting assigned to compute-0-6 and then deferred.
>>
>> There are lots of identical configured nodes free. I can force these
>> jobs to run on other nodes with qrun by hand but what is going on?
>>
>> Here is my maui config which worked fine in my older setup
>> ==========================================================
>> RMPOLLINTERVAL 00:00:30
>> SERVERHOST launchpad.nmr.mgh.harvard.edu
>> SERVERPORT 40559
>> SERVERMODE NORMAL
>> ADMINHOST launchpad.nmr.mgh.harvard.edu
>> RMCFG[base] TYPE=PBS
>> ADMIN1 maui root
>> ADMIN3 ALL
>> LOGFILE /var/spool/maui/log/maui.log
>> LOGFILEMAXSIZE 1000000000
>> LOGLEVEL 3
>> QUEUETIMEWEIGHT 1
>> CLASSWEIGHT 10
>> USERCFG[DEFAULT] MAXIPROC=8
>> CLASSCFG[default] MAXPROCPERUSER=150
>> CLASSCFG[matlab] MAXPROCPERUSER=60
>> CLASSCFG[max10] MAXPROCPERUSER=10
>> CLASSCFG[max20] MAXPROCPERUSER=20
>> CLASSCFG[max50] MAXPROCPERUSER=50
>> CLASSCFG[max75] MAXPROCPERUSER=75
>> CLASSCFG[max100] MAXPROCPERUSER=100
>> CLASSCFG[max200] MAXPROCPERUSER=200
>> CLASSCFG[p5] MAXPROCPERUSER=5000
>> CLASSCFG[p10] MAXPROCPERUSER=5000
>> CLASSCFG[p20] MAXPROCPERUSER=5000
>> CLASSCFG[p30] MAXPROCPERUSER=5000
>> CLASSCFG[p40] MAXPROCPERUSER=5000
>> CLASSCFG[p50] MAXPROCPERUSER=30
>> CLASSCFG[p60] MAXPROCPERUSER=20
>> CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250
>> CLASSCFG[GPU] MAXPROCPERUSER=5000
>> BACKFILLPOLICY FIRSTFIT
>> RESERVATIONPOLICY CURRENTHIGHEST
>> NODEALLOCATIONPOLICY PRIORITY
>> NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT'
>> ENFORCERESOURCELIMITS OFF
>> ENABLEMULTIREQJOBS TRUE
>> ====================================================
>>
>> There is nothing in the queue configs that would favor any nodes over
>> any other.
>>
>> ---------------------------------------------------------------
>> Paul Raines http://help.nmr.mgh.harvard.edu
>> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
>> 149 (2301) 13th Street Charlestown, MA 02129 USA
>>
>>
>>
>>
>
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
More information about the torqueusers
mailing list