[Mauiusers] maui not scheduling when no resources avaliable
Arnau Bria
arnaubria at pic.es
Thu Dec 13 05:05:06 MST 2007
Hi all,
we're using:
maui-3.2.6p19_20.snap.1182974819-4.slc3
maui-server-3.2.6p19_20.snap.1182974819-4.slc3
maui-client-3.2.6p19_20.snap.1182974819-4.slc3
torque-client-2.1.9-4cri.slc3
torque-server-2.1.9-4cri.slc3
torque-2.1.9-4cri.slc3
Last week we notice a strange behaviour in maui, and now, we're able to
reproduce:
1.-) we submit a job requesting a special resource, in example node
with Scientific Linux 3
$ qsub -q slc3 job.sh
3445312.pbs01.pic.es
our queue slc3 ask for that resource:
# qmgr -c "p s"|grep slc3
[...]
set queue slc3 resources_default.neednodes = slc3
[...]
2.-) We close the only node with that resource, so no WN will fit the
job.
# pbsnodes td248.pic.es
td248.pic.es
state = offline
np = 10
properties = slc3
ntype = cluster
[...]
3.-) Our job goes to the first position of our queue and maui see that
cannot find a WN
[...]
3440629 nsidro Running 1 3:00:00:00 Thu Dec 13 12:54:10
196 Active Jobs 196 of 249 Processors Active (78.71%)
57 of 62 Nodes Active (91.94%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
3445312 arnaubria Idle 1 3:00:00:00 Thu Dec 13 12:30:43
3440631 nsidro Idle 1 3:00:00:00 Wed Dec 12 20:58:15
[...]
# checkjob 3445312
checking job 3445312
State: Idle
Creds: user:arnaubria group:grid class:slc3 qos:DEFAULT
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Thu Dec 13 12:30:43
(Time Queued Total: 00:25:29 Eligible: 00:24:28)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [slc3]
IWD: [NONE] Executable: [NONE]
Bypass: 10 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE
PE: 1.00 StartPriority: 1000001000 SystemPriority: 1000
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
idle procs: 151 feasible procs: 0
Rejection Reasons: [Features : 64][State : 1]
4.-) Maui does not schedule any other job, so the farm gets empty.
Checkjob for the second job in queue.
# checkjob 3440631
checking job 3440631
State: Idle
Creds: user:nsidro group:magic class:long qos:lhmagic
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Wed Dec 12 20:58:15
(Time Queued Total: 15:57:57 Eligible: 15:51:32)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [slc4]
IWD: [NONE] Executable: [NONE]
Bypass: 10 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE
PE: 1.00 StartPriority: 89
job can run in partition DEFAULT (32 procs available. 1 procs required)
progress of jobs running/queued in our farm:
224 2358
222 2359
215 2355
213 2365
194 2371
...
5.) We open the WN again, and all works fine again.
# pbsnodes -c td248.pic.es
after that, immediately after jobs start again:
222 2337
Something similar happened when requesting hosst with "slc3 && slc4",
no nodes fit that condition and maui got hanged....
So, is it a bug?¿ Is anyone having same problem ? any workaround?
Cheers,
Arnau
More information about the mauiusers
mailing list