[Mauiusers] problems with SLURM --cpus-per-task
Josh England
josh at tgsmc.com
Fri Mar 14 14:44:13 MDT 2008
Answering my own post. This problem is resolved with slurm-1.2.24.
-JE
On Fri, 2008-03-14 at 11:27 -0700, Josh England wrote:
> Hi,
>
> I'm running slurm-1.2.22 along with maui-3.2.6p19 on a small test
> cluster. I'm able to run jobs normally through srun and sbatch using
> slurm's default 'backfill' scheduler. I'm also able ot run jobs using
> maui as the scheduler. However, when I try to run using 'srun
> --cpus-per task 2 hostname' the node allocation fails and the job never
> runs. It looks like some strange interaction is going on where maui is
> requested invalid resources from slurm when --cpus-per-task is used
> (which works fine without maui). Has anyone seen this before or know of
> a way to fix it?
>
>
> slurm logs show:
> Mar 14 11:13:57 ladmin1 slurmctld[27145]: _slurm_rpc_allocate_resources
> JobId=179 NodeList=(null) usec=26
> Mar 14 11:14:03 ladmin1 slurmctld[27145]: _pick_best_nodes 179 : job
> never runnable
> Mar 14 11:14:03 ladmin1 slurmctld[27145]: schedule: JobId=179
> non-runnable: Requested node configuration is not available
> Mar 14 11:14:03 ladmin1 slurmctld[27145]: error: wiki: Could not start
> job 179(lx10): Invalid request, job aborted
> Mar 14 11:14:03 ladmin1 slurmctld[27145]: error: wiki: start_job(179)
> job missing
>
>
>
> maui logs show:
> 03/14 11:14:03 MJobSetCreds(179,root,root,)
> 03/14 11:14:03 INFO: default QOS for job 179 set to DEFAULT(0)
> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
> 03/14 11:14:03 INFO: default QOS for job 179 set to DEFAULT(0)
> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
> 03/14 11:14:03 INFO: job '179' loaded: 1 root root
> 31536000 Idle 0 1205518437 [NONE] [NONE] [NONE] >= 1 >=
> 1 [NONE] 1205518437
> 03/14 11:14:03 INFO: 5 WIKI jobs detected on RM ladmin1
> 03/14 11:14:03 INFO: jobs detected: 5
> 03/14 11:14:03 MStatClearUsage(node,Active)
> 03/14 11:14:03 MClusterUpdateNodeState()
> 03/14 11:14:03 INFO: requeue value 126104855.00 found for immediate
> action (T: 00:00:00)
> 03/14 11:14:03 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
> 03/14 11:14:03 INFO: job '153' Priority: 7
> 03/14 11:14:03 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
> 0(00.0) Serv: 7(00.0) Targ: 0(00.0) Res: 0(00.0) Us:
> 0(00.0)
> 03/14 11:14:03 INFO: job '154' Priority: 7
> 03/14 11:14:03 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
> 0(00.0) Serv: 7(00.0) Targ: 0(00.0) Res: 0(00.0) Us:
> 0(00.0)
> 03/14 11:14:03 INFO: job '155' Priority: 7
> 03/14 11:14:03 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
> 0(00.0) Serv: 7(00.0) Targ: 0(00.0) Res: 0(00.0) Us:
> 0(00.0)
> 03/14 11:14:03 INFO: job '156' Priority: 7
> 03/14 11:14:03 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
> 0(00.0) Serv: 7(00.0) Targ: 0(00.0) Res: 0(00.0) Us:
> 0(00.0)
> 03/14 11:14:03 INFO: job '179' Priority: 1
> 03/14 11:14:03 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us:
> 0(00.0)
> 03/14 11:14:03 MStatClearUsage([NONE],Active)
> 03/14 11:14:03 INFO: total jobs selected (ALL): 1/5 [State: 4]
> 03/14 11:14:03 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
> 03/14 11:14:03 INFO: job '153' Priority: 7
> 03/14 11:14:03 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
> 0(00.0) Serv: 7(00.0) Targ: 0(00.0) Res: 0(00.0) Us:
> 0(00.0)
> 03/14 11:14:03 INFO: job '154' Priority: 7
> 03/14 11:14:03 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
> 0(00.0) Serv: 7(00.0) Targ: 0(00.0) Res: 0(00.0) Us:
> 0(00.0)
> 03/14 11:14:03 INFO: job '155' Priority: 7
> 03/14 11:14:03 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
> 0(00.0) Serv: 7(00.0) Targ: 0(00.0) Res: 0(00.0) Us:
> 0(00.0)
> 03/14 11:14:03 INFO: job '156' Priority: 7
> 03/14 11:14:03 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
> 0(00.0) Serv: 7(00.0) Targ: 0(00.0) Res: 0(00.0) Us:
> 0(00.0)
> 03/14 11:14:03 INFO: job '179' Priority: 1
> 03/14 11:14:03 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us:
> 0(00.0)
> 03/14 11:14:03 MStatClearUsage([NONE],Idle)
> 03/14 11:14:03 INFO: total jobs selected (ALL): 1/5 [State: 4]
> 03/14 11:14:03
> MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE)
> 03/14 11:14:03 INFO: total jobs selected in partition ALL: 1/1
> 03/14 11:14:03 MQueueScheduleRJobs(Q)
> 03/14 11:14:03
> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
> 03/14 11:14:03 INFO: total jobs selected in partition ALL: 1/1
> 03/14 11:14:03
> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,ALL,FReason,TRUE)
> 03/14 11:14:03 INFO: job 179 not considered for spanning
> 03/14 11:14:03 INFO: total jobs selected in partition ALL: 0/1
> [PartitionAccess: 1]
> 03/14 11:14:03
> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,test,FReason,TRUE)
> 03/14 11:14:03 INFO: total jobs selected in partition test: 1/1
> 03/14 11:14:03 MQueueScheduleIJobs(Q,test)
> 03/14 11:14:03 INFO: 20 feasible tasks found for job 179:0 in
> partition test (1 Needed)
> 03/14 11:14:03 INFO: tasks located for job 179: 1 of 1 required (18
> feasible)
> 03/14 11:14:03 MJobStart(179)
> 03/14 11:14:03 MJobDistributeTasks(179,ladmin1,NodeList,TaskMap)
> 03/14 11:14:03 MAMAllocJReserve(179,RIndex,ErrMsg)
> 03/14 11:14:03 MRMJobStart(179,Msg,SC)
> 03/14 11:14:03 MWikiJobStart(179,ladmin1,Msg,SC)
> 03/14 11:14:03 MWikiDoCommand(ladmin1,7321,9000000,NONE,CMD=STARTJOB
> ARG=179 TASKLIST=lx10,Data,DataSize,SC)
> 03/14 11:14:03 MSUSendData(S,9000000,FALSE,FALSE)
> 03/14 11:14:03 INFO: packet sent (43 bytes of 43)
> 03/14 11:14:03 INFO: command sent to server
> 03/14 11:14:03 INFO: message sent: 'CMD=STARTJOB ARG=179
> TASKLIST=lx10'
> 03/14 11:14:03 MSURecvData(S,9000000,FALSE,SC,EMsg)
> 03/14 11:14:03 MSURecvPacket(8,BufP,9,NULL,9000000,SC)
> 03/14 11:14:03 MSURecvPacket(8,BufP,176,NULL,9000000,SC)
> 03/14 11:14:03 MSUDisconnect(S)
> 03/14 11:14:03 ERROR: command 'CMD=STARTJOB ARG=179 TASKLIST=lx10'
> SC: -910 response: 'NONE'
> 03/14 11:14:03 ALERT: cannot start job '179' on WIKI RM on 1 procs
> (command failure)
> 03/14 11:14:03 ALERT: cannot start job 179 (RM 'ladmin1' failed in
> function 'jobstart')
> 03/14 11:14:03 WARNING: cannot start job '179' through resource manager
> 03/14 11:14:03 ALERT: job '179' deferred after 1 failed start
> attempts (API failure on last attempt)
> 03/14 11:14:03 MJobSetHold(179,16,1:00:00,RMFailure,)
> 03/14 11:14:03 ALERT: job '179' cannot run (deferring job for 3600
> seconds)
> 03/14 11:14:03 MSysRegEvent(JOBDEFER: defer hold placed on job '179'.
> reason: 'RMFailure',0,0,1)
> 03/14 11:14:03 MSysLaunchAction(ASList,1)
> 03/14 11:14:03 ERROR: cannot start job '179' in partition test
> 03/14 11:14:03 MJobPReserve(179,test,ResCount,ResCountRej)
> 03/14 11:14:03 MJobReserve(179,Priority)
> 03/14 11:14:03 INFO: 20 feasible tasks found for job 179:0 in
> partition test (1 Needed)
> 03/14 11:14:03 INFO: 20 feasible tasks found for job 179:0 in
> partition test (1 Needed)
> 03/14 11:14:03 INFO: located resources for 1 tasks (18) in best
> partition test for job 179 at time 00:00:01
> 03/14 11:14:03 INFO: tasks located for job 179: 1 of 1 required (18
> feasible)
> 03/14 11:14:03 MJobDistributeTasks(179,ladmin1,NodeList,TaskMap)
> 03/14 11:14:03 MResJCreate(179,MNodeList,00:00:01,Priority,Res)
> 03/14 11:14:03 INFO: job '179' reserved 1 tasks (partition test) to
> start in 00:00:01 on Fri Mar 14 11:14:04
>
>
> -JE
>
>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers
More information about the mauiusers
mailing list