[Moabusers] Jobs getting held with odd (probably wrong) messages

Douglas Wightman wightman at clusterresources.com
Wed Apr 4 09:12:28 MDT 2007


You can try the parameters:

JOBRETRYTIME
http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml#jobretrytime

and 

RESERVATIONRETRYTIME
http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml#reservationretrytime

which tell moab how to  deal with these types of transient failures.

- Douglas


On Wed, 2007-04-04 at 11:00 +1000, Chris Samuel wrote:
> On Wed, 4 Apr 2007, Douglas Wightman wrote:
> 
> > The fact that the job did go over its walltime leads me to believe that
> > these messages are real.  Can you try a checkjob -v -v on this job?
> 
> It's already finished I'm afraid, and I can't see any other candidates for 
> having been affected in this way. :-(
> 
> I was also hoping that setting DEFERSTARTCOUNT to 10 was going to help, but it 
> doesn't appear to.
> 
> > That can print out the time these messages were attached.  They may have
> > been attached in quick succession.  
> 
> I do have a potential candidate for a root cause, it looks like our mpiexec is 
> out of step with PBS (built against 2.0.0p9, we're running 2.0.0p10) so I'm 
> going to rebuild that and see whether it helps.
> 
> Of course even if it does solve our immediate issue it does mean that a job 
> that gets stuck and doesn't die for a minute or so is going to cause this 
> problem and it seems a little unfair on a queued job to hold it if a previous 
> job has overrun.
> 
> Shouldn't it just wait until either alternate resources become free or the 
> problem job goes ?
> 
> cheers!
> Chris
> _______________________________________________
> moabusers mailing list
> moabusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/moabusers



More information about the moabusers mailing list