[Moabusers] Jobs getting held with odd (probably wrong) messages

Sikora, Josef S josef.s.sikora at boeing.com
Thu Mar 15 14:11:17 MDT 2007


Douglas,

This went to the wrong person.

Josef

-----Original Message-----
From: Douglas Wightman [mailto:wightman at clusterresources.com] 
Sent: Wednesday, March 14, 2007 7:47 PM
To: Chris Samuel
Cc: moabusers at supercluster.org
Subject: Re: [Moabusers] Jobs getting held with odd (probably wrong)
messages

Is this still occurring frequently with the latest 5.1.0?  If so we'd
love to get some log files to track this down and get it fixed.

Thanks,

- Douglas

On Mon, 2007-02-26 at 10:39 +1100, Chris Samuel wrote:
> Hi folks,
> 
> We've been seeing some odd things with both the 5.0 production
releases
> and the 5.1 beta's on our IA32 Linux cluster.
> 
> We see jobs occasionally going into BatchHold with complaints that a
valid job
> has exceeded its walltime on various nodes, when in fact it still has
days
> (sometimes weeks) to go.
> 
> If we stop Moab, delete the .moab.ck and .moab.ck.1 files and restart
moab
> the job will run quite happily.
> 
> Here's a current example:
> 
> [root at brecca-m root]# checkjob -v 779373
> job 779373 (RM job '779373.brecca-m.vpac.org')
> 
> AName: j09
> State: Idle
> Creds:  user:xxxxx  group:xxxxx  class:dque
> WallTime:   00:00:00 of 4:00:00:00
> SubmitTime: Fri Feb 23 14:35:47
>   (Time Queued  Total: 2:19:54:09  Eligible: 1:01:30:49)
> 
> Total Requested Tasks: 20
> Total Requested Nodes: 1
> 
> Req[0]  TaskCount: 20  Partition: ALL
> Memory >= 0  Disk >= 0  Swap >= 0
> Opsys:   ---  Arch: ---  Features: ---
> Dedicated Resources Per Task: PROCS: 1
> NodeAccess: SHARED
> 
> 
> 
> OutputFile:     -
(head002:/home/san02/xxxx/Grant06/ParaCal/RunSplitt5n6/j09.o779373)
> ErrorFile:      -
(head002:/home/san02/xxxx/Grant06/ParaCal/RunSplitt5n6/j09.e779373)
> BypassCount:    36
> User Specified Partition Mask:   [SHARED][base]
> System Available Partition Mask: [SHARED][base]
> Partition Mask: [ALL]
> SrcRM:          base  DstRM: base  DstRMJID: 779373.brecca-m.vpac.org
> Flags:          RESTARTABLE,FSVIOLATION
> Attr:           FSVIOLATION,checkpoint
> StartPriority:  1
> PE:             20.00
> Holds:          Batch:NoResources
> NOTE:  job cannot run  (job has hold in place)
> Node Availability for Partition base --------
> 
> NOTE:  job cannot run  (job has hold in place)
> node001                  available: 2 tasks supported
> node002                  available: 1 tasks supported
> node003                  available: 1 tasks supported
> node004                  available: 2 tasks supported
> node005                  rejected: State (Busy)
> node006                  available: 1 tasks supported
> node007                  available: 2 tasks supported
> node008                  available: 2 tasks supported
> node009                  available: 2 tasks supported
> node010                  available: 2 tasks supported
> node011                  available: 2 tasks supported
> node012                  available: 2 tasks supported
> node013                  rejected: CPU
> node014                  available: 1 tasks supported
> node015                  available: 2 tasks supported
> node016                  rejected: State (Busy)
> node017                  rejected: State (Busy)
> node018                  available: 1 tasks supported
> node019                  rejected: State (Busy)
> node020                  rejected: State (Busy)
> node021                  available: 1 tasks supported
> node022                  available: 2 tasks supported
> node023                  rejected: State (Busy)
> node025                  rejected: State (Busy)
> node026                  rejected: State (Busy)
> node027                  rejected: State (Busy)
> node029                  rejected: State (Busy)
> node030                  rejected: CPU
> node031                  rejected: State (Busy)
> node032                  available: 1 tasks supported
> node033                  rejected: State (Busy)
> node034                  rejected: State (Busy)
> node035                  rejected: State (Busy)
> node036                  available: 1 tasks supported
> node037                  rejected: State (Busy)
> node038                  available: 1 tasks supported
> node039                  rejected: State (Busy)
> node040                  rejected: State (Busy)
> node041                  available: 1 tasks supported
> node042                  rejected: CPU
> node043                  available: 2 tasks supported
> node044                  available: 1 tasks supported
> node045                  rejected: State (Busy)
> node046                  available: 1 tasks supported
> node047                  available: 1 tasks supported
> node048                  rejected: CPU
> node049                  available: 2 tasks supported
> node050                  available: 1 tasks supported
> node051                  available: 1 tasks supported
> node052                  available: 1 tasks supported
> node053                  available: 2 tasks supported
> node054                  available: 2 tasks supported
> node055                  available: 1 tasks supported
> node056                  rejected: State (Busy)
> node057                  available: 2 tasks supported
> node058                  available: 2 tasks supported
> node059                  available: 1 tasks supported
> node060                  rejected: State (Busy)
> node061                  available: 1 tasks supported
> node062                  rejected: State (Busy)
> node063                  available: 1 tasks supported
> node064                  rejected: CPU
> node065                  available: 1 tasks supported
> node066                  rejected: State (Busy)
> node067                  rejected: State (Busy)
> node068                  available: 1 tasks supported
> node069                  rejected: State (Busy)
> node070                  rejected: State (Busy)
> node072                  rejected: State (Busy)
> node073                  available: 1 tasks supported
> node074                  rejected: State (Busy)
> node075                  available: 1 tasks supported
> node076                  available: 2 tasks supported
> node077                  rejected: CPU
> node078                  available: 1 tasks supported
> node079                  rejected: State (Busy)
> node080                  rejected: Reserved (sque.4)
> node081                  available: 1 tasks supported
> node082                  available: 2 tasks supported
> node083                  available: 1 tasks supported
> node084                  available: 2 tasks supported
> node085                  rejected: State (Down)
> node086                  rejected: State (Busy)
> node087                  rejected: CPU
> node088                  rejected: State (Busy)
> node089                  rejected: CPU
> node090                  available: 1 tasks supported
> NOTE:  job hold active - Batch
> Message[0] 20 nodes unavailable to start reserved job after 0 seconds
(job 778469 has exceeded wallclock limit on node node004 - check job)
> Message[1] 20 nodes unavailable to start reserved job after 30 seconds
(job 777116 has exceeded wallclock limit on node node004 - check job)
> Message[2] 18 nodes unavailable to start reserved job after 26 seconds
(job 777116 has exceeded wallclock limit on node node003 - check job)
> Message[3] 20 nodes unavailable to start reserved job after 28 seconds
(job 777116 has exceeded wallclock limit on node node003 - check job)
> Message[4] 15 nodes unavailable to start reserved job after 27 seconds
(job 779372 has exceeded wallclock limit on node node059 - check job)
> Message[5] 18 nodes unavailable to start reserved job after 0 seconds
(job 777116 has exceeded wallclock limit on node node001 - check job)
> Message[6] 20 nodes unavailable to start reserved job after 30 seconds
(job 778818 has exceeded wallclock limit on node node001 - check job)
> Message[7] 20 nodes unavailable to start reserved job after 31 seconds
(job 777116 has exceeded wallclock limit on node node001 - check job)
> Message[8] 18 nodes unavailable to start reserved job after 31 seconds
(job 777116 has exceeded wallclock limit on node node001 - check job)
> Message[9] 18 nodes unavailable to start reserved job after 30 seconds
(job 777116 has exceeded wallclock limit on node node001 - check job)
> Message[10] 18 nodes unavailable to start reserved job after 26
seconds (job 777116 has exceeded wallclock limit on node node001 - check
job)
> Message[11] 18 nodes unavailable to start reserved job after 30
seconds (job 777116 has exceeded wallclock limit on node node002 - check
job)
> Message[12] 18 nodes unavailable to start reserved job after 26
seconds (job 777116 has exceeded wallclock limit on node node002 - check
job)
> Message[13] 18 nodes unavailable to start reserved job after 29
seconds (job 777116 has exceeded wallclock limit on node node002 - check
job)
> Message[14] 18 nodes unavailable to start reserved job after 28
seconds (job 777116 has exceeded wallclock limit on node node002 - check
job)
> Message[15] 18 nodes unavailable to start reserved job after 1 seconds
(job 777116 has exceeded wallclock limit on node node003 - check job)
> Message[16] 18 nodes unavailable to start reserved job after 1 seconds
(job 777116 has exceeded wallclock limit on node node004 - check job)
> Message[17] 18 nodes unavailable to start reserved job after 0 seconds
(job 777116 has exceeded wallclock limit on node node007 - check job)
> Message[18] 18 nodes unavailable to start reserved job after 1 seconds
(job 777116 has exceeded wallclock limit on node node007 - check job)
> 
> 
> Jobs 778469 and 779372 are no longer in the system.
> 
> Looking at jobs 777116 and 778818 we see:
> 
> [root at brecca-m root]# checkjob  777116
> job 777116
> 
> AName: ZeusMP2Mountain
> State: Running
> Creds:  user:yyyy  group:yyyy  class:run_2_months
> WallTime:   19:10:02:53 of 48:00:00:00
> SubmitTime: Tue Feb  6 09:18:17
>   (Time Queued  Total: 15:11:06  Eligible: 20:01:13:17)
> 
> StartTime: Wed Feb  7 00:29:23
> Total Requested Tasks: 32
> 
> Req[0]  TaskCount: 32  Partition: base
> Memory >= 0  Disk >= 0  Swap >= 0
> Opsys:   ---  Arch: ---  Features: ---
> NodesRequested:  20
> 
> Allocated Nodes:
> [node077:1][node074:2][node069:2][node037:2]
> [node031:2][node030:1][node029:2][node089:1]
> [node087:1][node066:2][node064:1][node048:1]
> [node042:1][node025:2][node023:2][node020:2]
> [node019:2][node017:2][node016:2][node013:1]
> 
> 
> 
> Flags:          RESTARTABLE,FSVIOLATION
> Attr:           FSVIOLATION,checkpoint
> StartPriority:  336
> Reservation '777116' (   -19days -> 28:13:57:49  Duration:
48:00:00:00)
> 
> 
> [root at brecca-m root]# checkjob 778818
> job 778818
> 
> AName: j00
> State: Running
> Creds:  user:xxxx  group:xxxx  class:dque
> WallTime:   1:10:40:18 of 4:00:00:00
> SubmitTime: Tue Feb 20 12:23:25
>   (Time Queued  Total: 4:11:33:54  Eligible: 00:00:00)
> 
> StartTime: Sat Feb 24 23:57:19
> Total Requested Tasks: 20
> 
> Req[0]  TaskCount: 20  Partition: base
> Memory >= 0  Disk >= 0  Swap >= 0
> Opsys:   ---  Arch: ---  Features: ---
> NodesRequested:  16
> 
> Allocated Nodes:
> [node086:1][node083:1][node081:1][node075:1]
> [node073:1][node068:1][node067:2][node036:1]
> [node035:2][node090:1][node060:1][node059:1]
> [node056:2][node050:1][node045:1][node040:2]
> 
> 
> 
> StartCount:     1
> BypassCount:    33
> Flags:          RESTARTABLE,FSVIOLATION
> Attr:           FSVIOLATION,checkpoint
> StartPriority:  1
> Reservation '778818' (-1:10:40:16 -> 2:13:19:44  Duration: 4:00:00:00)
> 
> 
> Very odd..
> 
> cheers!
> Chris

_______________________________________________
moabusers mailing list
moabusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/moabusers


More information about the moabusers mailing list