[Mauiusers] API failure with slurm
Josh England
josh at tgsmc.com
Fri May 23 10:57:53 MDT 2008
I'm testing maui-3.2.6p19 with slurm-1.3.2 and found a bug in our
specific use case.
I'm using slurm's cons_res plugin with CR_Core and sched/wiki and I want
each job to use 3 cores per task. So I'm submitting like 'sbatch -c3
job.sh'. On an 8-core box, the first 2 jobs land on 1 node, but the 3rd
job ends up spanning 2 nodes (2 cores on 1 and 1 on another). Fine. So
I add a '-N' parameter to specify a max of 1 nodes: 'sbatch -c3 -N 1-1'.
This works fine with slurm alone, but maui seems to not respect that
parameter at all. Relevant parts of the maui logs show:
...
05/23 09:20:43 INFO: job 1121 not considered for spanning
...
05/23 09:20:43 MWikiDoCommand(ladmin1,7321,9000000,NONE,CMD=STARTJOB
ARG=1121 TASKLIST=dn37:dn37:dn36,Data,DataSize,SC)
05/23 09:20:43 INFO: message sent: 'CMD=STARTJOB ARG=1121
TASKLIST=dn37:dn37:dn36'
05/23 09:20:43 ERROR: command 'CMD=STARTJOB ARG=1121
TASKLIST=dn37:dn37:dn36' SC: -914 response: 'NONE'
05/23 09:20:43 ALERT: cannot start job '1121' on WIKI RM on 3 procs
(command failure)
05/23 09:20:43 ALERT: cannot start job 1121 (RM 'ladmin1' failed in
function 'jobstart')
05/23 09:20:43 WARNING: cannot start job '1121' through resource
manager
05/23 09:20:43 ALERT: job '1121' deferred after 1 failed start
attempts (API failure on last attempt)
Maui is trying to allocate two nodes for the job even though I specified
only one, which is probably what leads to that API failure. I seem to
remember this working right on previous versions of slurm. Any ideas?
-JE
More information about the mauiusers
mailing list