[Moabusers] Jobs getting held with odd (probably wrong) messages
Douglas Wightman
wightman at clusterresources.com
Tue Apr 3 09:13:18 MDT 2007
The fact that the job did go over its walltime leads me to believe that
these messages are real. Can you try a checkjob -v -v on this job?
That can print out the time these messages were attached. They may have
been attached in quick succession.
- Douglas
On Tue, 2007-04-03 at 09:33 +1000, Chris Samuel wrote:
> On Thu, 15 Mar 2007, Douglas Wightman wrote:
>
> > Is this still occurring frequently with the latest 5.1.0? If so we'd
> > love to get some log files to track this down and get it fixed.
>
> OK - just confirmed that it's happened again and a couple of jobs went into
> BatchHold.
>
> We had set DEFERTIME 00:01:00 thinking that would mean the job would be
> retried after a minute but it looks like jobs stay in BatchHold indefinitely
> and we need to releashold them now before they start (which is an improvement
> on nuking the checkpoint files).
>
> BLOCK MSG: job hold active - Batch (recorded at last scheduling iteration)
> Message[0] 19 nodes unavailable to start reserved job after 1 seconds (job 170860 has exceeded wallclock limit on node edda001 - check job)
> Message[1] 15 nodes unavailable to start reserved job after 1 seconds (job 170860 has exceeded wallclock limit on node edda011 - check job)
> Message[2] 14 nodes unavailable to start reserved job after 1 seconds (job 170860 has exceeded wallclock limit on node edda011 - check job)
> Message[3] 15 nodes unavailable to start reserved job after 1 seconds (job 174887 has exceeded wallclock limit on node edda004 - check job)
> Message[4] 4 nodes unavailable to start reserved job after 1 seconds (job 175018 has exceeded wallclock limit on node edda039 - check job)
>
> None of those jobs are around at the moment. Could this be a race
> between cleaning up previous jobs and starting the new one ?
>
> Looking at job 170860 it does appear to have exceeded its walltime by
> about 45 seconds..
>
> Resource_List.walltime=72:00:00
> resources_used.walltime=72:00:45
>
> On the mom I see a lot of messages after it tries to kill it saying:
>
> 04/02/2007 12:09:44;0001; pbs_mom;Job;170860.edda-m.vpac.org;cannot tm_reply to task 1
>
> Which could be related if that is delaying the termination of the job.
>
> cheers!
> Chris
> _______________________________________________
> moabusers mailing list
> moabusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/moabusers
More information about the moabusers
mailing list