[Moabusers] Jobs getting held with odd (probably wrong) messages

Chris Samuel csamuel at vpac.org
Tue Apr 3 19:00:08 MDT 2007


On Wed, 4 Apr 2007, Douglas Wightman wrote:

> The fact that the job did go over its walltime leads me to believe that
> these messages are real.  Can you try a checkjob -v -v on this job?

It's already finished I'm afraid, and I can't see any other candidates for 
having been affected in this way. :-(

I was also hoping that setting DEFERSTARTCOUNT to 10 was going to help, but it 
doesn't appear to.

> That can print out the time these messages were attached.  They may have
> been attached in quick succession.  

I do have a potential candidate for a root cause, it looks like our mpiexec is 
out of step with PBS (built against 2.0.0p9, we're running 2.0.0p10) so I'm 
going to rebuild that and see whether it helps.

Of course even if it does solve our immediate issue it does mean that a job 
that gets stuck and doesn't die for a minute or so is going to cause this 
problem and it seems a little unfair on a queued job to hold it if a previous 
job has overrun.

Shouldn't it just wait until either alternate resources become free or the 
problem job goes ?

cheers!
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/moabusers/attachments/20070404/e3b5e7aa/attachment.bin


More information about the moabusers mailing list