[Moabusers] Moab keeps on trying after pbs_mom rejects.

Justin Bronder jsbronder at gmail.com
Wed Nov 22 09:29:11 MST 2006


According to the documentation and mschedctl -l, we have the default of five
minutes
already set.  These jobs are being resubmitted to the prologue script every
minute
or less.  The prologue signals success, but the nodes reject the job.  Next
iteration
of Moab, the job is sent to the prologue again, with the same hostlist.

So I would assume that Moab either doesn't know the job is getting rejected,
which
seems strange as the pbs_mom's are correctly reporting errors in their logs,
or
somehow Moab is failing to realize that we have a misbehaving node.

-Justin.

On 11/22/06, wightman <wightman at clusterresources.com> wrote:
>
> Have a look at:
>
>
> http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml#nodefailurereservetime
>
> When Moab knows which node is causing problems this parameter will tell
> Moab to put a reservation on the node, thus taking it out of the pool of
> feasible nodes.
>
> - Douglas
>
>
> On Wed, 2006-11-22 at 09:59 -0500, Justin Bronder wrote:
> > We have a Moab Prologue setup on the cluster and have alerted our
> > users that
> > when they see the job queued in Torque and running in Moab that it
> > means their
> > job is currently in the prologue (yes they could use checkjob -v, but
> > that hasn't
> > caught on yet).  Yesterday I was notified that a job was continuously
> > bouncing
> > in and out of the prologue.
> >
> > The problem apparently was that one of the nodes was failing LDAP
> > lookups, so
> > pbs_mom was rejecting the job.  This was easy enough to track back
> > from the
> > mother superior and then to the failing node.  However, Moab continued
> > to
> > schedule to that node despite the same failure each time.
> >
> > Is there any method to get Moab to watch for this sort of scenario and
> > re-calculate
> > the hosts it should run the job on?  Even marking the misbehaving node
> > offline
> > did not force Moab to change the node list, only forcing a recycle
> > from the
> > command line got the job running on a new hostlist.
> >
> > Thanks,
> >
> > Justin.
> > _______________________________________________
> > moabusers mailing list
> > moabusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/moabusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/moabusers/attachments/20061122/3c77bc2d/attachment.html


More information about the moabusers mailing list