[Mauiusers] problems with SLURM --cpus-per-task

Josh England josh at tgsmc.com
Fri Mar 14 14:44:13 MDT 2008


Answering my own post.  This problem is resolved with slurm-1.2.24.

-JE

On Fri, 2008-03-14 at 11:27 -0700, Josh England wrote:
> Hi,
> 
> I'm running slurm-1.2.22 along with maui-3.2.6p19 on a small test
> cluster.  I'm able to run jobs normally through srun and sbatch using
> slurm's default 'backfill' scheduler.  I'm also able ot run jobs using
> maui as the scheduler.  However, when I try to run using 'srun
> --cpus-per task 2 hostname' the node allocation fails and the job never
> runs.  It looks like some strange interaction is going on where maui is
> requested invalid resources from slurm when --cpus-per-task is used
> (which works fine without maui).  Has anyone seen this before or know of
> a way to fix it?
> 
> 
> slurm logs show:
> Mar 14 11:13:57 ladmin1 slurmctld[27145]: _slurm_rpc_allocate_resources
> JobId=179 NodeList=(null) usec=26
> Mar 14 11:14:03 ladmin1 slurmctld[27145]: _pick_best_nodes 179 : job
> never runnable
> Mar 14 11:14:03 ladmin1 slurmctld[27145]: schedule: JobId=179
> non-runnable: Requested node configuration is not available
> Mar 14 11:14:03 ladmin1 slurmctld[27145]: error: wiki: Could not start
> job 179(lx10): Invalid request, job aborted
> Mar 14 11:14:03 ladmin1 slurmctld[27145]: error: wiki: start_job(179)
> job missing
> 
> 
> 
> maui logs show:
> 03/14 11:14:03 MJobSetCreds(179,root,root,)
> 03/14 11:14:03 INFO:     default QOS for job 179 set to DEFAULT(0)
> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
> 03/14 11:14:03 INFO:     default QOS for job 179 set to DEFAULT(0)
> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
> 03/14 11:14:03 INFO:     job '179' loaded:   1     root     root
> 31536000       Idle   0 1205518437   [NONE] [NONE] [NONE] >=      1 >=
> 1 [NONE] 1205518437
> 03/14 11:14:03 INFO:     5 WIKI jobs detected on RM ladmin1
> 03/14 11:14:03 INFO:     jobs detected: 5
> 03/14 11:14:03 MStatClearUsage(node,Active)
> 03/14 11:14:03 MClusterUpdateNodeState()
> 03/14 11:14:03 INFO:     requeue value 126104855.00 found for immediate
> action (T: 00:00:00)
> 03/14 11:14:03 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
> 03/14 11:14:03 INFO:     job '153' Priority:        7
> 03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
> 0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
> 0(00.0)
> 03/14 11:14:03 INFO:     job '154' Priority:        7
> 03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
> 0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
> 0(00.0)
> 03/14 11:14:03 INFO:     job '155' Priority:        7
> 03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
> 0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
> 0(00.0)
> 03/14 11:14:03 INFO:     job '156' Priority:        7
> 03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
> 0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
> 0(00.0)
> 03/14 11:14:03 INFO:     job '179' Priority:        1
> 03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
> 0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
> 0(00.0)
> 03/14 11:14:03 MStatClearUsage([NONE],Active)
> 03/14 11:14:03 INFO:     total jobs selected (ALL): 1/5 [State: 4]
> 03/14 11:14:03 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
> 03/14 11:14:03 INFO:     job '153' Priority:        7
> 03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
> 0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
> 0(00.0)
> 03/14 11:14:03 INFO:     job '154' Priority:        7
> 03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
> 0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
> 0(00.0)
> 03/14 11:14:03 INFO:     job '155' Priority:        7
> 03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
> 0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
> 0(00.0)
> 03/14 11:14:03 INFO:     job '156' Priority:        7
> 03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
> 0(00.0)  Serv:      7(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
> 0(00.0)
> 03/14 11:14:03 INFO:     job '179' Priority:        1
> 03/14 11:14:03 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
> 0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
> 0(00.0)
> 03/14 11:14:03 MStatClearUsage([NONE],Idle)
> 03/14 11:14:03 INFO:     total jobs selected (ALL): 1/5 [State: 4]
> 03/14 11:14:03
> MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE)
> 03/14 11:14:03 INFO:     total jobs selected in partition ALL: 1/1 
> 03/14 11:14:03 MQueueScheduleRJobs(Q)
> 03/14 11:14:03
> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
> 03/14 11:14:03 INFO:     total jobs selected in partition ALL: 1/1 
> 03/14 11:14:03
> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,ALL,FReason,TRUE)
> 03/14 11:14:03 INFO:     job 179 not considered for spanning
> 03/14 11:14:03 INFO:     total jobs selected in partition ALL: 0/1
> [PartitionAccess: 1]
> 03/14 11:14:03
> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,test,FReason,TRUE)
> 03/14 11:14:03 INFO:     total jobs selected in partition test: 1/1 
> 03/14 11:14:03 MQueueScheduleIJobs(Q,test)
> 03/14 11:14:03 INFO:     20 feasible tasks found for job 179:0 in
> partition test (1 Needed)
> 03/14 11:14:03 INFO:     tasks located for job 179:  1 of 1 required (18
> feasible)
> 03/14 11:14:03 MJobStart(179)
> 03/14 11:14:03 MJobDistributeTasks(179,ladmin1,NodeList,TaskMap)
> 03/14 11:14:03 MAMAllocJReserve(179,RIndex,ErrMsg)
> 03/14 11:14:03 MRMJobStart(179,Msg,SC)
> 03/14 11:14:03 MWikiJobStart(179,ladmin1,Msg,SC)
> 03/14 11:14:03 MWikiDoCommand(ladmin1,7321,9000000,NONE,CMD=STARTJOB
> ARG=179 TASKLIST=lx10,Data,DataSize,SC)
> 03/14 11:14:03 MSUSendData(S,9000000,FALSE,FALSE)
> 03/14 11:14:03 INFO:     packet sent (43 bytes of 43)
> 03/14 11:14:03 INFO:     command sent to server
> 03/14 11:14:03 INFO:     message sent: 'CMD=STARTJOB ARG=179
> TASKLIST=lx10'
> 03/14 11:14:03 MSURecvData(S,9000000,FALSE,SC,EMsg)
> 03/14 11:14:03 MSURecvPacket(8,BufP,9,NULL,9000000,SC)
> 03/14 11:14:03 MSURecvPacket(8,BufP,176,NULL,9000000,SC)
> 03/14 11:14:03 MSUDisconnect(S)
> 03/14 11:14:03 ERROR:    command 'CMD=STARTJOB ARG=179 TASKLIST=lx10'
> SC: -910  response: 'NONE'
> 03/14 11:14:03 ALERT:    cannot start job '179' on WIKI RM on 1 procs
> (command failure)
> 03/14 11:14:03 ALERT:    cannot start job 179 (RM 'ladmin1' failed in
> function 'jobstart')
> 03/14 11:14:03 WARNING:  cannot start job '179' through resource manager
> 03/14 11:14:03 ALERT:    job '179' deferred after 1 failed start
> attempts (API failure on last attempt)
> 03/14 11:14:03 MJobSetHold(179,16,1:00:00,RMFailure,)
> 03/14 11:14:03 ALERT:    job '179' cannot run (deferring job for 3600
> seconds)
> 03/14 11:14:03 MSysRegEvent(JOBDEFER:  defer hold placed on job '179'.
> reason: 'RMFailure',0,0,1)
> 03/14 11:14:03 MSysLaunchAction(ASList,1)
> 03/14 11:14:03 ERROR:    cannot start job '179' in partition test
> 03/14 11:14:03 MJobPReserve(179,test,ResCount,ResCountRej)
> 03/14 11:14:03 MJobReserve(179,Priority)
> 03/14 11:14:03 INFO:     20 feasible tasks found for job 179:0 in
> partition test (1 Needed)
> 03/14 11:14:03 INFO:     20 feasible tasks found for job 179:0 in
> partition test (1 Needed)
> 03/14 11:14:03 INFO:     located resources for 1 tasks (18) in best
> partition test for job 179 at time 00:00:01
> 03/14 11:14:03 INFO:     tasks located for job 179:  1 of 1 required (18
> feasible)
> 03/14 11:14:03 MJobDistributeTasks(179,ladmin1,NodeList,TaskMap)
> 03/14 11:14:03 MResJCreate(179,MNodeList,00:00:01,Priority,Res)
> 03/14 11:14:03 INFO:     job '179' reserved 1 tasks (partition test) to
> start in 00:00:01 on Fri Mar 14 11:14:04
> 
> 
> -JE
> 
> 
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers



More information about the mauiusers mailing list