[Moabusers] Jobs getting held with odd (probably wrong) messages
Chris Samuel
csamuel at vpac.org
Mon Apr 2 17:33:15 MDT 2007
On Thu, 15 Mar 2007, Douglas Wightman wrote:
> Is this still occurring frequently with the latest 5.1.0? If so we'd
> love to get some log files to track this down and get it fixed.
OK - just confirmed that it's happened again and a couple of jobs went into
BatchHold.
We had set DEFERTIME 00:01:00 thinking that would mean the job would be
retried after a minute but it looks like jobs stay in BatchHold indefinitely
and we need to releashold them now before they start (which is an improvement
on nuking the checkpoint files).
BLOCK MSG: job hold active - Batch (recorded at last scheduling iteration)
Message[0] 19 nodes unavailable to start reserved job after 1 seconds (job 170860 has exceeded wallclock limit on node edda001 - check job)
Message[1] 15 nodes unavailable to start reserved job after 1 seconds (job 170860 has exceeded wallclock limit on node edda011 - check job)
Message[2] 14 nodes unavailable to start reserved job after 1 seconds (job 170860 has exceeded wallclock limit on node edda011 - check job)
Message[3] 15 nodes unavailable to start reserved job after 1 seconds (job 174887 has exceeded wallclock limit on node edda004 - check job)
Message[4] 4 nodes unavailable to start reserved job after 1 seconds (job 175018 has exceeded wallclock limit on node edda039 - check job)
None of those jobs are around at the moment. Could this be a race
between cleaning up previous jobs and starting the new one ?
Looking at job 170860 it does appear to have exceeded its walltime by
about 45 seconds..
Resource_List.walltime=72:00:00
resources_used.walltime=72:00:45
On the mom I see a lot of messages after it tries to kill it saying:
04/02/2007 12:09:44;0001; pbs_mom;Job;170860.edda-m.vpac.org;cannot tm_reply to task 1
Which could be related if that is delaying the termination of the job.
cheers!
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/moabusers/attachments/20070403/9635cee0/attachment.bin
More information about the moabusers
mailing list