[Moabusers] Jobs getting held with odd (probably wrong) messages

Chris Samuel csamuel at vpac.org
Sun Feb 25 16:39:33 MST 2007


Hi folks,

We've been seeing some odd things with both the 5.0 production releases
and the 5.1 beta's on our IA32 Linux cluster.

We see jobs occasionally going into BatchHold with complaints that a valid job
has exceeded its walltime on various nodes, when in fact it still has days
(sometimes weeks) to go.

If we stop Moab, delete the .moab.ck and .moab.ck.1 files and restart moab
the job will run quite happily.

Here's a current example:

[root at brecca-m root]# checkjob -v 779373
job 779373 (RM job '779373.brecca-m.vpac.org')

AName: j09
State: Idle
Creds:  user:xxxxx  group:xxxxx  class:dque
WallTime:   00:00:00 of 4:00:00:00
SubmitTime: Fri Feb 23 14:35:47
  (Time Queued  Total: 2:19:54:09  Eligible: 1:01:30:49)

Total Requested Tasks: 20
Total Requested Nodes: 1

Req[0]  TaskCount: 20  Partition: ALL
Memory >= 0  Disk >= 0  Swap >= 0
Opsys:   ---  Arch: ---  Features: ---
Dedicated Resources Per Task: PROCS: 1
NodeAccess: SHARED



OutputFile:     - (head002:/home/san02/xxxx/Grant06/ParaCal/RunSplitt5n6/j09.o779373)
ErrorFile:      - (head002:/home/san02/xxxx/Grant06/ParaCal/RunSplitt5n6/j09.e779373)
BypassCount:    36
User Specified Partition Mask:   [SHARED][base]
System Available Partition Mask: [SHARED][base]
Partition Mask: [ALL]
SrcRM:          base  DstRM: base  DstRMJID: 779373.brecca-m.vpac.org
Flags:          RESTARTABLE,FSVIOLATION
Attr:           FSVIOLATION,checkpoint
StartPriority:  1
PE:             20.00
Holds:          Batch:NoResources
NOTE:  job cannot run  (job has hold in place)
Node Availability for Partition base --------

NOTE:  job cannot run  (job has hold in place)
node001                  available: 2 tasks supported
node002                  available: 1 tasks supported
node003                  available: 1 tasks supported
node004                  available: 2 tasks supported
node005                  rejected: State (Busy)
node006                  available: 1 tasks supported
node007                  available: 2 tasks supported
node008                  available: 2 tasks supported
node009                  available: 2 tasks supported
node010                  available: 2 tasks supported
node011                  available: 2 tasks supported
node012                  available: 2 tasks supported
node013                  rejected: CPU
node014                  available: 1 tasks supported
node015                  available: 2 tasks supported
node016                  rejected: State (Busy)
node017                  rejected: State (Busy)
node018                  available: 1 tasks supported
node019                  rejected: State (Busy)
node020                  rejected: State (Busy)
node021                  available: 1 tasks supported
node022                  available: 2 tasks supported
node023                  rejected: State (Busy)
node025                  rejected: State (Busy)
node026                  rejected: State (Busy)
node027                  rejected: State (Busy)
node029                  rejected: State (Busy)
node030                  rejected: CPU
node031                  rejected: State (Busy)
node032                  available: 1 tasks supported
node033                  rejected: State (Busy)
node034                  rejected: State (Busy)
node035                  rejected: State (Busy)
node036                  available: 1 tasks supported
node037                  rejected: State (Busy)
node038                  available: 1 tasks supported
node039                  rejected: State (Busy)
node040                  rejected: State (Busy)
node041                  available: 1 tasks supported
node042                  rejected: CPU
node043                  available: 2 tasks supported
node044                  available: 1 tasks supported
node045                  rejected: State (Busy)
node046                  available: 1 tasks supported
node047                  available: 1 tasks supported
node048                  rejected: CPU
node049                  available: 2 tasks supported
node050                  available: 1 tasks supported
node051                  available: 1 tasks supported
node052                  available: 1 tasks supported
node053                  available: 2 tasks supported
node054                  available: 2 tasks supported
node055                  available: 1 tasks supported
node056                  rejected: State (Busy)
node057                  available: 2 tasks supported
node058                  available: 2 tasks supported
node059                  available: 1 tasks supported
node060                  rejected: State (Busy)
node061                  available: 1 tasks supported
node062                  rejected: State (Busy)
node063                  available: 1 tasks supported
node064                  rejected: CPU
node065                  available: 1 tasks supported
node066                  rejected: State (Busy)
node067                  rejected: State (Busy)
node068                  available: 1 tasks supported
node069                  rejected: State (Busy)
node070                  rejected: State (Busy)
node072                  rejected: State (Busy)
node073                  available: 1 tasks supported
node074                  rejected: State (Busy)
node075                  available: 1 tasks supported
node076                  available: 2 tasks supported
node077                  rejected: CPU
node078                  available: 1 tasks supported
node079                  rejected: State (Busy)
node080                  rejected: Reserved (sque.4)
node081                  available: 1 tasks supported
node082                  available: 2 tasks supported
node083                  available: 1 tasks supported
node084                  available: 2 tasks supported
node085                  rejected: State (Down)
node086                  rejected: State (Busy)
node087                  rejected: CPU
node088                  rejected: State (Busy)
node089                  rejected: CPU
node090                  available: 1 tasks supported
NOTE:  job hold active - Batch
Message[0] 20 nodes unavailable to start reserved job after 0 seconds (job 778469 has exceeded wallclock limit on node node004 - check job)
Message[1] 20 nodes unavailable to start reserved job after 30 seconds (job 777116 has exceeded wallclock limit on node node004 - check job)
Message[2] 18 nodes unavailable to start reserved job after 26 seconds (job 777116 has exceeded wallclock limit on node node003 - check job)
Message[3] 20 nodes unavailable to start reserved job after 28 seconds (job 777116 has exceeded wallclock limit on node node003 - check job)
Message[4] 15 nodes unavailable to start reserved job after 27 seconds (job 779372 has exceeded wallclock limit on node node059 - check job)
Message[5] 18 nodes unavailable to start reserved job after 0 seconds (job 777116 has exceeded wallclock limit on node node001 - check job)
Message[6] 20 nodes unavailable to start reserved job after 30 seconds (job 778818 has exceeded wallclock limit on node node001 - check job)
Message[7] 20 nodes unavailable to start reserved job after 31 seconds (job 777116 has exceeded wallclock limit on node node001 - check job)
Message[8] 18 nodes unavailable to start reserved job after 31 seconds (job 777116 has exceeded wallclock limit on node node001 - check job)
Message[9] 18 nodes unavailable to start reserved job after 30 seconds (job 777116 has exceeded wallclock limit on node node001 - check job)
Message[10] 18 nodes unavailable to start reserved job after 26 seconds (job 777116 has exceeded wallclock limit on node node001 - check job)
Message[11] 18 nodes unavailable to start reserved job after 30 seconds (job 777116 has exceeded wallclock limit on node node002 - check job)
Message[12] 18 nodes unavailable to start reserved job after 26 seconds (job 777116 has exceeded wallclock limit on node node002 - check job)
Message[13] 18 nodes unavailable to start reserved job after 29 seconds (job 777116 has exceeded wallclock limit on node node002 - check job)
Message[14] 18 nodes unavailable to start reserved job after 28 seconds (job 777116 has exceeded wallclock limit on node node002 - check job)
Message[15] 18 nodes unavailable to start reserved job after 1 seconds (job 777116 has exceeded wallclock limit on node node003 - check job)
Message[16] 18 nodes unavailable to start reserved job after 1 seconds (job 777116 has exceeded wallclock limit on node node004 - check job)
Message[17] 18 nodes unavailable to start reserved job after 0 seconds (job 777116 has exceeded wallclock limit on node node007 - check job)
Message[18] 18 nodes unavailable to start reserved job after 1 seconds (job 777116 has exceeded wallclock limit on node node007 - check job)


Jobs 778469 and 779372 are no longer in the system.

Looking at jobs 777116 and 778818 we see:

[root at brecca-m root]# checkjob  777116
job 777116

AName: ZeusMP2Mountain
State: Running
Creds:  user:yyyy  group:yyyy  class:run_2_months
WallTime:   19:10:02:53 of 48:00:00:00
SubmitTime: Tue Feb  6 09:18:17
  (Time Queued  Total: 15:11:06  Eligible: 20:01:13:17)

StartTime: Wed Feb  7 00:29:23
Total Requested Tasks: 32

Req[0]  TaskCount: 32  Partition: base
Memory >= 0  Disk >= 0  Swap >= 0
Opsys:   ---  Arch: ---  Features: ---
NodesRequested:  20

Allocated Nodes:
[node077:1][node074:2][node069:2][node037:2]
[node031:2][node030:1][node029:2][node089:1]
[node087:1][node066:2][node064:1][node048:1]
[node042:1][node025:2][node023:2][node020:2]
[node019:2][node017:2][node016:2][node013:1]



Flags:          RESTARTABLE,FSVIOLATION
Attr:           FSVIOLATION,checkpoint
StartPriority:  336
Reservation '777116' (   -19days -> 28:13:57:49  Duration: 48:00:00:00)


[root at brecca-m root]# checkjob 778818
job 778818

AName: j00
State: Running
Creds:  user:xxxx  group:xxxx  class:dque
WallTime:   1:10:40:18 of 4:00:00:00
SubmitTime: Tue Feb 20 12:23:25
  (Time Queued  Total: 4:11:33:54  Eligible: 00:00:00)

StartTime: Sat Feb 24 23:57:19
Total Requested Tasks: 20

Req[0]  TaskCount: 20  Partition: base
Memory >= 0  Disk >= 0  Swap >= 0
Opsys:   ---  Arch: ---  Features: ---
NodesRequested:  16

Allocated Nodes:
[node086:1][node083:1][node081:1][node075:1]
[node073:1][node068:1][node067:2][node036:1]
[node035:2][node090:1][node060:1][node059:1]
[node056:2][node050:1][node045:1][node040:2]



StartCount:     1
BypassCount:    33
Flags:          RESTARTABLE,FSVIOLATION
Attr:           FSVIOLATION,checkpoint
StartPriority:  1
Reservation '778818' (-1:10:40:16 -> 2:13:19:44  Duration: 4:00:00:00)


Very odd..

cheers!
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia



More information about the moabusers mailing list