[Moabusers] Jobs getting held with odd (probably wrong) messages
Chris Samuel
csamuel at vpac.org
Tue Apr 3 19:00:08 MDT 2007
On Wed, 4 Apr 2007, Douglas Wightman wrote:
> The fact that the job did go over its walltime leads me to believe that
> these messages are real. Can you try a checkjob -v -v on this job?
It's already finished I'm afraid, and I can't see any other candidates for
having been affected in this way. :-(
I was also hoping that setting DEFERSTARTCOUNT to 10 was going to help, but it
doesn't appear to.
> That can print out the time these messages were attached. They may have
> been attached in quick succession.
I do have a potential candidate for a root cause, it looks like our mpiexec is
out of step with PBS (built against 2.0.0p9, we're running 2.0.0p10) so I'm
going to rebuild that and see whether it helps.
Of course even if it does solve our immediate issue it does mean that a job
that gets stuck and doesn't die for a minute or so is going to cause this
problem and it seems a little unfair on a queued job to hold it if a previous
job has overrun.
Shouldn't it just wait until either alternate resources become free or the
problem job goes ?
cheers!
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/moabusers/attachments/20070404/e3b5e7aa/attachment.bin
More information about the moabusers
mailing list