[Moabusers] Jobs getting held with odd (probably wrong) messages
Chris Samuel
csamuel at vpac.org
Sun Feb 25 16:39:33 MST 2007
Hi folks,
We've been seeing some odd things with both the 5.0 production releases
and the 5.1 beta's on our IA32 Linux cluster.
We see jobs occasionally going into BatchHold with complaints that a valid job
has exceeded its walltime on various nodes, when in fact it still has days
(sometimes weeks) to go.
If we stop Moab, delete the .moab.ck and .moab.ck.1 files and restart moab
the job will run quite happily.
Here's a current example:
[root at brecca-m root]# checkjob -v 779373
job 779373 (RM job '779373.brecca-m.vpac.org')
AName: j09
State: Idle
Creds: user:xxxxx group:xxxxx class:dque
WallTime: 00:00:00 of 4:00:00:00
SubmitTime: Fri Feb 23 14:35:47
(Time Queued Total: 2:19:54:09 Eligible: 1:01:30:49)
Total Requested Tasks: 20
Total Requested Nodes: 1
Req[0] TaskCount: 20 Partition: ALL
Memory >= 0 Disk >= 0 Swap >= 0
Opsys: --- Arch: --- Features: ---
Dedicated Resources Per Task: PROCS: 1
NodeAccess: SHARED
OutputFile: - (head002:/home/san02/xxxx/Grant06/ParaCal/RunSplitt5n6/j09.o779373)
ErrorFile: - (head002:/home/san02/xxxx/Grant06/ParaCal/RunSplitt5n6/j09.e779373)
BypassCount: 36
User Specified Partition Mask: [SHARED][base]
System Available Partition Mask: [SHARED][base]
Partition Mask: [ALL]
SrcRM: base DstRM: base DstRMJID: 779373.brecca-m.vpac.org
Flags: RESTARTABLE,FSVIOLATION
Attr: FSVIOLATION,checkpoint
StartPriority: 1
PE: 20.00
Holds: Batch:NoResources
NOTE: job cannot run (job has hold in place)
Node Availability for Partition base --------
NOTE: job cannot run (job has hold in place)
node001 available: 2 tasks supported
node002 available: 1 tasks supported
node003 available: 1 tasks supported
node004 available: 2 tasks supported
node005 rejected: State (Busy)
node006 available: 1 tasks supported
node007 available: 2 tasks supported
node008 available: 2 tasks supported
node009 available: 2 tasks supported
node010 available: 2 tasks supported
node011 available: 2 tasks supported
node012 available: 2 tasks supported
node013 rejected: CPU
node014 available: 1 tasks supported
node015 available: 2 tasks supported
node016 rejected: State (Busy)
node017 rejected: State (Busy)
node018 available: 1 tasks supported
node019 rejected: State (Busy)
node020 rejected: State (Busy)
node021 available: 1 tasks supported
node022 available: 2 tasks supported
node023 rejected: State (Busy)
node025 rejected: State (Busy)
node026 rejected: State (Busy)
node027 rejected: State (Busy)
node029 rejected: State (Busy)
node030 rejected: CPU
node031 rejected: State (Busy)
node032 available: 1 tasks supported
node033 rejected: State (Busy)
node034 rejected: State (Busy)
node035 rejected: State (Busy)
node036 available: 1 tasks supported
node037 rejected: State (Busy)
node038 available: 1 tasks supported
node039 rejected: State (Busy)
node040 rejected: State (Busy)
node041 available: 1 tasks supported
node042 rejected: CPU
node043 available: 2 tasks supported
node044 available: 1 tasks supported
node045 rejected: State (Busy)
node046 available: 1 tasks supported
node047 available: 1 tasks supported
node048 rejected: CPU
node049 available: 2 tasks supported
node050 available: 1 tasks supported
node051 available: 1 tasks supported
node052 available: 1 tasks supported
node053 available: 2 tasks supported
node054 available: 2 tasks supported
node055 available: 1 tasks supported
node056 rejected: State (Busy)
node057 available: 2 tasks supported
node058 available: 2 tasks supported
node059 available: 1 tasks supported
node060 rejected: State (Busy)
node061 available: 1 tasks supported
node062 rejected: State (Busy)
node063 available: 1 tasks supported
node064 rejected: CPU
node065 available: 1 tasks supported
node066 rejected: State (Busy)
node067 rejected: State (Busy)
node068 available: 1 tasks supported
node069 rejected: State (Busy)
node070 rejected: State (Busy)
node072 rejected: State (Busy)
node073 available: 1 tasks supported
node074 rejected: State (Busy)
node075 available: 1 tasks supported
node076 available: 2 tasks supported
node077 rejected: CPU
node078 available: 1 tasks supported
node079 rejected: State (Busy)
node080 rejected: Reserved (sque.4)
node081 available: 1 tasks supported
node082 available: 2 tasks supported
node083 available: 1 tasks supported
node084 available: 2 tasks supported
node085 rejected: State (Down)
node086 rejected: State (Busy)
node087 rejected: CPU
node088 rejected: State (Busy)
node089 rejected: CPU
node090 available: 1 tasks supported
NOTE: job hold active - Batch
Message[0] 20 nodes unavailable to start reserved job after 0 seconds (job 778469 has exceeded wallclock limit on node node004 - check job)
Message[1] 20 nodes unavailable to start reserved job after 30 seconds (job 777116 has exceeded wallclock limit on node node004 - check job)
Message[2] 18 nodes unavailable to start reserved job after 26 seconds (job 777116 has exceeded wallclock limit on node node003 - check job)
Message[3] 20 nodes unavailable to start reserved job after 28 seconds (job 777116 has exceeded wallclock limit on node node003 - check job)
Message[4] 15 nodes unavailable to start reserved job after 27 seconds (job 779372 has exceeded wallclock limit on node node059 - check job)
Message[5] 18 nodes unavailable to start reserved job after 0 seconds (job 777116 has exceeded wallclock limit on node node001 - check job)
Message[6] 20 nodes unavailable to start reserved job after 30 seconds (job 778818 has exceeded wallclock limit on node node001 - check job)
Message[7] 20 nodes unavailable to start reserved job after 31 seconds (job 777116 has exceeded wallclock limit on node node001 - check job)
Message[8] 18 nodes unavailable to start reserved job after 31 seconds (job 777116 has exceeded wallclock limit on node node001 - check job)
Message[9] 18 nodes unavailable to start reserved job after 30 seconds (job 777116 has exceeded wallclock limit on node node001 - check job)
Message[10] 18 nodes unavailable to start reserved job after 26 seconds (job 777116 has exceeded wallclock limit on node node001 - check job)
Message[11] 18 nodes unavailable to start reserved job after 30 seconds (job 777116 has exceeded wallclock limit on node node002 - check job)
Message[12] 18 nodes unavailable to start reserved job after 26 seconds (job 777116 has exceeded wallclock limit on node node002 - check job)
Message[13] 18 nodes unavailable to start reserved job after 29 seconds (job 777116 has exceeded wallclock limit on node node002 - check job)
Message[14] 18 nodes unavailable to start reserved job after 28 seconds (job 777116 has exceeded wallclock limit on node node002 - check job)
Message[15] 18 nodes unavailable to start reserved job after 1 seconds (job 777116 has exceeded wallclock limit on node node003 - check job)
Message[16] 18 nodes unavailable to start reserved job after 1 seconds (job 777116 has exceeded wallclock limit on node node004 - check job)
Message[17] 18 nodes unavailable to start reserved job after 0 seconds (job 777116 has exceeded wallclock limit on node node007 - check job)
Message[18] 18 nodes unavailable to start reserved job after 1 seconds (job 777116 has exceeded wallclock limit on node node007 - check job)
Jobs 778469 and 779372 are no longer in the system.
Looking at jobs 777116 and 778818 we see:
[root at brecca-m root]# checkjob 777116
job 777116
AName: ZeusMP2Mountain
State: Running
Creds: user:yyyy group:yyyy class:run_2_months
WallTime: 19:10:02:53 of 48:00:00:00
SubmitTime: Tue Feb 6 09:18:17
(Time Queued Total: 15:11:06 Eligible: 20:01:13:17)
StartTime: Wed Feb 7 00:29:23
Total Requested Tasks: 32
Req[0] TaskCount: 32 Partition: base
Memory >= 0 Disk >= 0 Swap >= 0
Opsys: --- Arch: --- Features: ---
NodesRequested: 20
Allocated Nodes:
[node077:1][node074:2][node069:2][node037:2]
[node031:2][node030:1][node029:2][node089:1]
[node087:1][node066:2][node064:1][node048:1]
[node042:1][node025:2][node023:2][node020:2]
[node019:2][node017:2][node016:2][node013:1]
Flags: RESTARTABLE,FSVIOLATION
Attr: FSVIOLATION,checkpoint
StartPriority: 336
Reservation '777116' ( -19days -> 28:13:57:49 Duration: 48:00:00:00)
[root at brecca-m root]# checkjob 778818
job 778818
AName: j00
State: Running
Creds: user:xxxx group:xxxx class:dque
WallTime: 1:10:40:18 of 4:00:00:00
SubmitTime: Tue Feb 20 12:23:25
(Time Queued Total: 4:11:33:54 Eligible: 00:00:00)
StartTime: Sat Feb 24 23:57:19
Total Requested Tasks: 20
Req[0] TaskCount: 20 Partition: base
Memory >= 0 Disk >= 0 Swap >= 0
Opsys: --- Arch: --- Features: ---
NodesRequested: 16
Allocated Nodes:
[node086:1][node083:1][node081:1][node075:1]
[node073:1][node068:1][node067:2][node036:1]
[node035:2][node090:1][node060:1][node059:1]
[node056:2][node050:1][node045:1][node040:2]
StartCount: 1
BypassCount: 33
Flags: RESTARTABLE,FSVIOLATION
Attr: FSVIOLATION,checkpoint
StartPriority: 1
Reservation '778818' (-1:10:40:16 -> 2:13:19:44 Duration: 4:00:00:00)
Very odd..
cheers!
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
More information about the moabusers
mailing list