[Moabusers] Jobs getting held with odd (probably wrong) messages
Douglas Wightman
wightman at clusterresources.com
Wed Apr 4 09:12:28 MDT 2007
You can try the parameters:
JOBRETRYTIME
http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml#jobretrytime
and
RESERVATIONRETRYTIME
http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml#reservationretrytime
which tell moab how to deal with these types of transient failures.
- Douglas
On Wed, 2007-04-04 at 11:00 +1000, Chris Samuel wrote:
> On Wed, 4 Apr 2007, Douglas Wightman wrote:
>
> > The fact that the job did go over its walltime leads me to believe that
> > these messages are real. Can you try a checkjob -v -v on this job?
>
> It's already finished I'm afraid, and I can't see any other candidates for
> having been affected in this way. :-(
>
> I was also hoping that setting DEFERSTARTCOUNT to 10 was going to help, but it
> doesn't appear to.
>
> > That can print out the time these messages were attached. They may have
> > been attached in quick succession.
>
> I do have a potential candidate for a root cause, it looks like our mpiexec is
> out of step with PBS (built against 2.0.0p9, we're running 2.0.0p10) so I'm
> going to rebuild that and see whether it helps.
>
> Of course even if it does solve our immediate issue it does mean that a job
> that gets stuck and doesn't die for a minute or so is going to cause this
> problem and it seems a little unfair on a queued job to hold it if a previous
> job has overrun.
>
> Shouldn't it just wait until either alternate resources become free or the
> problem job goes ?
>
> cheers!
> Chris
> _______________________________________________
> moabusers mailing list
> moabusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/moabusers
More information about the moabusers
mailing list