[Moabusers] Moab keeps on trying after pbs_mom rejects.
wightman
wightman at clusterresources.com
Wed Nov 22 09:09:02 MST 2006
Have a look at:
http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml#nodefailurereservetime
When Moab knows which node is causing problems this parameter will tell
Moab to put a reservation on the node, thus taking it out of the pool of
feasible nodes.
- Douglas
On Wed, 2006-11-22 at 09:59 -0500, Justin Bronder wrote:
> We have a Moab Prologue setup on the cluster and have alerted our
> users that
> when they see the job queued in Torque and running in Moab that it
> means their
> job is currently in the prologue (yes they could use checkjob -v, but
> that hasn't
> caught on yet). Yesterday I was notified that a job was continuously
> bouncing
> in and out of the prologue.
>
> The problem apparently was that one of the nodes was failing LDAP
> lookups, so
> pbs_mom was rejecting the job. This was easy enough to track back
> from the
> mother superior and then to the failing node. However, Moab continued
> to
> schedule to that node despite the same failure each time.
>
> Is there any method to get Moab to watch for this sort of scenario and
> re-calculate
> the hosts it should run the job on? Even marking the misbehaving node
> offline
> did not force Moab to change the node list, only forcing a recycle
> from the
> command line got the job running on a new hostlist.
>
> Thanks,
>
> Justin.
> _______________________________________________
> moabusers mailing list
> moabusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/moabusers
More information about the moabusers
mailing list