[Mauiusers] problems with SLURM --cpus-per-task

Josh England josh at tgsmc.com
Fri Mar 14 12:27:05 MDT 2008


Hi,

I'm running slurm-1.2.22 along with maui-3.2.6p19 on a small test
cluster.  I'm able to run jobs normally through srun and sbatch using
slurm's default 'backfill' scheduler.  I'm also able ot run jobs using
maui as the scheduler.  However, when I try to run using 'srun
--cpus-per task 2 hostname' the node allocation fails and the job never
runs.  It looks like some strange interaction is going on where maui is
requested invalid resources from slurm when --cpus-per-task is used
(which works fine without maui).  Has anyone seen this before or know of
a way to fix it?


slurm logs show:
Mar 14 11:13:57 ladmin1 slurmctld[27145]: _slurm_rpc_allocate_resources
JobId=179 NodeList=(null) usec=26
Mar 14 11:14:03 ladmin1 slurmctld[27145]: _pick_best_nodes 179 : job
never runnable
Mar 14 11:14:03 ladmin1 slurmctld[27145]: schedule: JobId=179
non-runnable: Requested node configuration is not available
Mar 14 11:14:03 ladmin1 slurmctld[27145]: error: wiki: Could not start
job 179(lx10): Invalid request, job aborted
Mar 14 11:14:03 ladmin1 slurmctld[27145]: error: wiki: start_job(179)
job missing



maui logs show:
03/14 11:14:03 MJobSetCreds(179,root,root,)
03/14 11:14:03 INFO:     default QOS for job 179 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
03/14 11:14:03 INFO:     default QOS for job 179 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
03/14 11:14:03 INFO:     job '179' loaded:   1     root     root
31536000       Idle   0 1205518437   [NONE] [NONE] [NONE] >=      1 >=
1 [NONE] 1205518437
03/14 11:14:03 INFO:     5 WIKI jobs detected on RM ladmin1
03/14 11:14:03 INFO:     jobs detected: 5
03/14 11:14:03 MStatClearUsage(node,Active)
03/14 11:14:03 MClusterUpdateNodeState()
03/14 11:14:03 INFO:     requeue value 126104855.00 found for immediate
action (T: 00:00:00)
03/14 11:14:03 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
03/14 11:14:03 INFO:     job '153' Priority:        7
03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
0(00.0)
03/14 11:14:03 INFO:     job '154' Priority:        7
03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
0(00.0)
03/14 11:14:03 INFO:     job '155' Priority:        7
03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
0(00.0)
03/14 11:14:03 INFO:     job '156' Priority:        7
03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
0(00.0)
03/14 11:14:03 INFO:     job '179' Priority:        1
03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
0(00.0)
03/14 11:14:03 MStatClearUsage([NONE],Active)
03/14 11:14:03 INFO:     total jobs selected (ALL): 1/5 [State: 4]
03/14 11:14:03 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
03/14 11:14:03 INFO:     job '153' Priority:        7
03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
0(00.0)
03/14 11:14:03 INFO:     job '154' Priority:        7
03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
0(00.0)
03/14 11:14:03 INFO:     job '155' Priority:        7
03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
0(00.0)
03/14 11:14:03 INFO:     job '156' Priority:        7
03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
0(00.0)
03/14 11:14:03 INFO:     job '179' Priority:        1
03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
0(00.0)
03/14 11:14:03 MStatClearUsage([NONE],Idle)
03/14 11:14:03 INFO:     total jobs selected (ALL): 1/5 [State: 4]
03/14 11:14:03
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE)
03/14 11:14:03 INFO:     total jobs selected in partition ALL: 1/1 
03/14 11:14:03 MQueueScheduleRJobs(Q)
03/14 11:14:03
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
03/14 11:14:03 INFO:     total jobs selected in partition ALL: 1/1 
03/14 11:14:03
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,ALL,FReason,TRUE)
03/14 11:14:03 INFO:     job 179 not considered for spanning
03/14 11:14:03 INFO:     total jobs selected in partition ALL: 0/1
[PartitionAccess: 1]
03/14 11:14:03
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,test,FReason,TRUE)
03/14 11:14:03 INFO:     total jobs selected in partition test: 1/1 
03/14 11:14:03 MQueueScheduleIJobs(Q,test)
03/14 11:14:03 INFO:     20 feasible tasks found for job 179:0 in
partition test (1 Needed)
03/14 11:14:03 INFO:     tasks located for job 179:  1 of 1 required (18
feasible)
03/14 11:14:03 MJobStart(179)
03/14 11:14:03 MJobDistributeTasks(179,ladmin1,NodeList,TaskMap)
03/14 11:14:03 MAMAllocJReserve(179,RIndex,ErrMsg)
03/14 11:14:03 MRMJobStart(179,Msg,SC)
03/14 11:14:03 MWikiJobStart(179,ladmin1,Msg,SC)
03/14 11:14:03 MWikiDoCommand(ladmin1,7321,9000000,NONE,CMD=STARTJOB
ARG=179 TASKLIST=lx10,Data,DataSize,SC)
03/14 11:14:03 MSUSendData(S,9000000,FALSE,FALSE)
03/14 11:14:03 INFO:     packet sent (43 bytes of 43)
03/14 11:14:03 INFO:     command sent to server
03/14 11:14:03 INFO:     message sent: 'CMD=STARTJOB ARG=179
TASKLIST=lx10'
03/14 11:14:03 MSURecvData(S,9000000,FALSE,SC,EMsg)
03/14 11:14:03 MSURecvPacket(8,BufP,9,NULL,9000000,SC)
03/14 11:14:03 MSURecvPacket(8,BufP,176,NULL,9000000,SC)
03/14 11:14:03 MSUDisconnect(S)
03/14 11:14:03 ERROR:    command 'CMD=STARTJOB ARG=179 TASKLIST=lx10'
SC: -910  response: 'NONE'
03/14 11:14:03 ALERT:    cannot start job '179' on WIKI RM on 1 procs
(command failure)
03/14 11:14:03 ALERT:    cannot start job 179 (RM 'ladmin1' failed in
function 'jobstart')
03/14 11:14:03 WARNING:  cannot start job '179' through resource manager
03/14 11:14:03 ALERT:    job '179' deferred after 1 failed start
attempts (API failure on last attempt)
03/14 11:14:03 MJobSetHold(179,16,1:00:00,RMFailure,)
03/14 11:14:03 ALERT:    job '179' cannot run (deferring job for 3600
seconds)
03/14 11:14:03 MSysRegEvent(JOBDEFER:  defer hold placed on job '179'.
reason: 'RMFailure',0,0,1)
03/14 11:14:03 MSysLaunchAction(ASList,1)
03/14 11:14:03 ERROR:    cannot start job '179' in partition test
03/14 11:14:03 MJobPReserve(179,test,ResCount,ResCountRej)
03/14 11:14:03 MJobReserve(179,Priority)
03/14 11:14:03 INFO:     20 feasible tasks found for job 179:0 in
partition test (1 Needed)
03/14 11:14:03 INFO:     20 feasible tasks found for job 179:0 in
partition test (1 Needed)
03/14 11:14:03 INFO:     located resources for 1 tasks (18) in best
partition test for job 179 at time 00:00:01
03/14 11:14:03 INFO:     tasks located for job 179:  1 of 1 required (18
feasible)
03/14 11:14:03 MJobDistributeTasks(179,ladmin1,NodeList,TaskMap)
03/14 11:14:03 MResJCreate(179,MNodeList,00:00:01,Priority,Res)
03/14 11:14:03 INFO:     job '179' reserved 1 tasks (partition test) to
start in 00:00:01 on Fri Mar 14 11:14:04


-JE




More information about the mauiusers mailing list