[Moabusers] Jobs getting held with odd (probably wrong) messages
Sikora, Josef S
josef.s.sikora at boeing.com
Thu Mar 15 14:11:17 MDT 2007
Douglas,
This went to the wrong person.
Josef
-----Original Message-----
From: Douglas Wightman [mailto:wightman at clusterresources.com]
Sent: Wednesday, March 14, 2007 7:47 PM
To: Chris Samuel
Cc: moabusers at supercluster.org
Subject: Re: [Moabusers] Jobs getting held with odd (probably wrong)
messages
Is this still occurring frequently with the latest 5.1.0? If so we'd
love to get some log files to track this down and get it fixed.
Thanks,
- Douglas
On Mon, 2007-02-26 at 10:39 +1100, Chris Samuel wrote:
> Hi folks,
>
> We've been seeing some odd things with both the 5.0 production
releases
> and the 5.1 beta's on our IA32 Linux cluster.
>
> We see jobs occasionally going into BatchHold with complaints that a
valid job
> has exceeded its walltime on various nodes, when in fact it still has
days
> (sometimes weeks) to go.
>
> If we stop Moab, delete the .moab.ck and .moab.ck.1 files and restart
moab
> the job will run quite happily.
>
> Here's a current example:
>
> [root at brecca-m root]# checkjob -v 779373
> job 779373 (RM job '779373.brecca-m.vpac.org')
>
> AName: j09
> State: Idle
> Creds: user:xxxxx group:xxxxx class:dque
> WallTime: 00:00:00 of 4:00:00:00
> SubmitTime: Fri Feb 23 14:35:47
> (Time Queued Total: 2:19:54:09 Eligible: 1:01:30:49)
>
> Total Requested Tasks: 20
> Total Requested Nodes: 1
>
> Req[0] TaskCount: 20 Partition: ALL
> Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: --- Arch: --- Features: ---
> Dedicated Resources Per Task: PROCS: 1
> NodeAccess: SHARED
>
>
>
> OutputFile: -
(head002:/home/san02/xxxx/Grant06/ParaCal/RunSplitt5n6/j09.o779373)
> ErrorFile: -
(head002:/home/san02/xxxx/Grant06/ParaCal/RunSplitt5n6/j09.e779373)
> BypassCount: 36
> User Specified Partition Mask: [SHARED][base]
> System Available Partition Mask: [SHARED][base]
> Partition Mask: [ALL]
> SrcRM: base DstRM: base DstRMJID: 779373.brecca-m.vpac.org
> Flags: RESTARTABLE,FSVIOLATION
> Attr: FSVIOLATION,checkpoint
> StartPriority: 1
> PE: 20.00
> Holds: Batch:NoResources
> NOTE: job cannot run (job has hold in place)
> Node Availability for Partition base --------
>
> NOTE: job cannot run (job has hold in place)
> node001 available: 2 tasks supported
> node002 available: 1 tasks supported
> node003 available: 1 tasks supported
> node004 available: 2 tasks supported
> node005 rejected: State (Busy)
> node006 available: 1 tasks supported
> node007 available: 2 tasks supported
> node008 available: 2 tasks supported
> node009 available: 2 tasks supported
> node010 available: 2 tasks supported
> node011 available: 2 tasks supported
> node012 available: 2 tasks supported
> node013 rejected: CPU
> node014 available: 1 tasks supported
> node015 available: 2 tasks supported
> node016 rejected: State (Busy)
> node017 rejected: State (Busy)
> node018 available: 1 tasks supported
> node019 rejected: State (Busy)
> node020 rejected: State (Busy)
> node021 available: 1 tasks supported
> node022 available: 2 tasks supported
> node023 rejected: State (Busy)
> node025 rejected: State (Busy)
> node026 rejected: State (Busy)
> node027 rejected: State (Busy)
> node029 rejected: State (Busy)
> node030 rejected: CPU
> node031 rejected: State (Busy)
> node032 available: 1 tasks supported
> node033 rejected: State (Busy)
> node034 rejected: State (Busy)
> node035 rejected: State (Busy)
> node036 available: 1 tasks supported
> node037 rejected: State (Busy)
> node038 available: 1 tasks supported
> node039 rejected: State (Busy)
> node040 rejected: State (Busy)
> node041 available: 1 tasks supported
> node042 rejected: CPU
> node043 available: 2 tasks supported
> node044 available: 1 tasks supported
> node045 rejected: State (Busy)
> node046 available: 1 tasks supported
> node047 available: 1 tasks supported
> node048 rejected: CPU
> node049 available: 2 tasks supported
> node050 available: 1 tasks supported
> node051 available: 1 tasks supported
> node052 available: 1 tasks supported
> node053 available: 2 tasks supported
> node054 available: 2 tasks supported
> node055 available: 1 tasks supported
> node056 rejected: State (Busy)
> node057 available: 2 tasks supported
> node058 available: 2 tasks supported
> node059 available: 1 tasks supported
> node060 rejected: State (Busy)
> node061 available: 1 tasks supported
> node062 rejected: State (Busy)
> node063 available: 1 tasks supported
> node064 rejected: CPU
> node065 available: 1 tasks supported
> node066 rejected: State (Busy)
> node067 rejected: State (Busy)
> node068 available: 1 tasks supported
> node069 rejected: State (Busy)
> node070 rejected: State (Busy)
> node072 rejected: State (Busy)
> node073 available: 1 tasks supported
> node074 rejected: State (Busy)
> node075 available: 1 tasks supported
> node076 available: 2 tasks supported
> node077 rejected: CPU
> node078 available: 1 tasks supported
> node079 rejected: State (Busy)
> node080 rejected: Reserved (sque.4)
> node081 available: 1 tasks supported
> node082 available: 2 tasks supported
> node083 available: 1 tasks supported
> node084 available: 2 tasks supported
> node085 rejected: State (Down)
> node086 rejected: State (Busy)
> node087 rejected: CPU
> node088 rejected: State (Busy)
> node089 rejected: CPU
> node090 available: 1 tasks supported
> NOTE: job hold active - Batch
> Message[0] 20 nodes unavailable to start reserved job after 0 seconds
(job 778469 has exceeded wallclock limit on node node004 - check job)
> Message[1] 20 nodes unavailable to start reserved job after 30 seconds
(job 777116 has exceeded wallclock limit on node node004 - check job)
> Message[2] 18 nodes unavailable to start reserved job after 26 seconds
(job 777116 has exceeded wallclock limit on node node003 - check job)
> Message[3] 20 nodes unavailable to start reserved job after 28 seconds
(job 777116 has exceeded wallclock limit on node node003 - check job)
> Message[4] 15 nodes unavailable to start reserved job after 27 seconds
(job 779372 has exceeded wallclock limit on node node059 - check job)
> Message[5] 18 nodes unavailable to start reserved job after 0 seconds
(job 777116 has exceeded wallclock limit on node node001 - check job)
> Message[6] 20 nodes unavailable to start reserved job after 30 seconds
(job 778818 has exceeded wallclock limit on node node001 - check job)
> Message[7] 20 nodes unavailable to start reserved job after 31 seconds
(job 777116 has exceeded wallclock limit on node node001 - check job)
> Message[8] 18 nodes unavailable to start reserved job after 31 seconds
(job 777116 has exceeded wallclock limit on node node001 - check job)
> Message[9] 18 nodes unavailable to start reserved job after 30 seconds
(job 777116 has exceeded wallclock limit on node node001 - check job)
> Message[10] 18 nodes unavailable to start reserved job after 26
seconds (job 777116 has exceeded wallclock limit on node node001 - check
job)
> Message[11] 18 nodes unavailable to start reserved job after 30
seconds (job 777116 has exceeded wallclock limit on node node002 - check
job)
> Message[12] 18 nodes unavailable to start reserved job after 26
seconds (job 777116 has exceeded wallclock limit on node node002 - check
job)
> Message[13] 18 nodes unavailable to start reserved job after 29
seconds (job 777116 has exceeded wallclock limit on node node002 - check
job)
> Message[14] 18 nodes unavailable to start reserved job after 28
seconds (job 777116 has exceeded wallclock limit on node node002 - check
job)
> Message[15] 18 nodes unavailable to start reserved job after 1 seconds
(job 777116 has exceeded wallclock limit on node node003 - check job)
> Message[16] 18 nodes unavailable to start reserved job after 1 seconds
(job 777116 has exceeded wallclock limit on node node004 - check job)
> Message[17] 18 nodes unavailable to start reserved job after 0 seconds
(job 777116 has exceeded wallclock limit on node node007 - check job)
> Message[18] 18 nodes unavailable to start reserved job after 1 seconds
(job 777116 has exceeded wallclock limit on node node007 - check job)
>
>
> Jobs 778469 and 779372 are no longer in the system.
>
> Looking at jobs 777116 and 778818 we see:
>
> [root at brecca-m root]# checkjob 777116
> job 777116
>
> AName: ZeusMP2Mountain
> State: Running
> Creds: user:yyyy group:yyyy class:run_2_months
> WallTime: 19:10:02:53 of 48:00:00:00
> SubmitTime: Tue Feb 6 09:18:17
> (Time Queued Total: 15:11:06 Eligible: 20:01:13:17)
>
> StartTime: Wed Feb 7 00:29:23
> Total Requested Tasks: 32
>
> Req[0] TaskCount: 32 Partition: base
> Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: --- Arch: --- Features: ---
> NodesRequested: 20
>
> Allocated Nodes:
> [node077:1][node074:2][node069:2][node037:2]
> [node031:2][node030:1][node029:2][node089:1]
> [node087:1][node066:2][node064:1][node048:1]
> [node042:1][node025:2][node023:2][node020:2]
> [node019:2][node017:2][node016:2][node013:1]
>
>
>
> Flags: RESTARTABLE,FSVIOLATION
> Attr: FSVIOLATION,checkpoint
> StartPriority: 336
> Reservation '777116' ( -19days -> 28:13:57:49 Duration:
48:00:00:00)
>
>
> [root at brecca-m root]# checkjob 778818
> job 778818
>
> AName: j00
> State: Running
> Creds: user:xxxx group:xxxx class:dque
> WallTime: 1:10:40:18 of 4:00:00:00
> SubmitTime: Tue Feb 20 12:23:25
> (Time Queued Total: 4:11:33:54 Eligible: 00:00:00)
>
> StartTime: Sat Feb 24 23:57:19
> Total Requested Tasks: 20
>
> Req[0] TaskCount: 20 Partition: base
> Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: --- Arch: --- Features: ---
> NodesRequested: 16
>
> Allocated Nodes:
> [node086:1][node083:1][node081:1][node075:1]
> [node073:1][node068:1][node067:2][node036:1]
> [node035:2][node090:1][node060:1][node059:1]
> [node056:2][node050:1][node045:1][node040:2]
>
>
>
> StartCount: 1
> BypassCount: 33
> Flags: RESTARTABLE,FSVIOLATION
> Attr: FSVIOLATION,checkpoint
> StartPriority: 1
> Reservation '778818' (-1:10:40:16 -> 2:13:19:44 Duration: 4:00:00:00)
>
>
> Very odd..
>
> cheers!
> Chris
_______________________________________________
moabusers mailing list
moabusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/moabusers
More information about the moabusers
mailing list