[Moabusers] Moab keeps on trying after pbs_mom rejects.
wightman
wightman at clusterresources.com
Wed Nov 22 10:32:55 MST 2006
Can you tell from the pbs_server logs that the server nodes which node
is causing the problem?
- Douglas
On Wed, 2006-11-22 at 11:29 -0500, Justin Bronder wrote:
> According to the documentation and mschedctl -l, we have the default
> of five minutes
> already set. These jobs are being resubmitted to the prologue script
> every minute
> or less. The prologue signals success, but the nodes reject the job.
> Next iteration
> of Moab, the job is sent to the prologue again, with the same
> hostlist.
>
> So I would assume that Moab either doesn't know the job is getting
> rejected, which
> seems strange as the pbs_mom's are correctly reporting errors in their
> logs, or
> somehow Moab is failing to realize that we have a misbehaving node.
>
> -Justin.
>
> On 11/22/06, wightman <wightman at clusterresources.com> wrote:
> Have a look at:
>
> http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml#nodefailurereservetime
>
> When Moab knows which node is causing problems this parameter
> will tell
> Moab to put a reservation on the node, thus taking it out of
> the pool of
> feasible nodes.
>
> - Douglas
>
>
> On Wed, 2006-11-22 at 09:59 -0500, Justin Bronder wrote:
> > We have a Moab Prologue setup on the cluster and have
> alerted our
> > users that
> > when they see the job queued in Torque and running in Moab
> that it
> > means their
> > job is currently in the prologue (yes they could use
> checkjob -v, but
> > that hasn't
> > caught on yet). Yesterday I was notified that a job was
> continuously
> > bouncing
> > in and out of the prologue.
> >
> > The problem apparently was that one of the nodes was failing
> LDAP
> > lookups, so
> > pbs_mom was rejecting the job. This was easy enough to
> track back
> > from the
> > mother superior and then to the failing node. However, Moab
> continued
> > to
> > schedule to that node despite the same failure each time.
> >
> > Is there any method to get Moab to watch for this sort of
> scenario and
> > re-calculate
> > the hosts it should run the job on? Even marking the
> misbehaving node
> > offline
> > did not force Moab to change the node list, only forcing a
> recycle
> > from the
> > command line got the job running on a new hostlist.
> >
> > Thanks,
> >
> > Justin.
> > _______________________________________________
> > moabusers mailing list
> > moabusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/moabusers
>
>
More information about the moabusers
mailing list