[Moabusers] Moab keeps on trying after pbs_mom rejects.

wightman wightman at clusterresources.com
Wed Nov 22 09:09:02 MST 2006


Have a look at:

http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml#nodefailurereservetime

When Moab knows which node is causing problems this parameter will tell
Moab to put a reservation on the node, thus taking it out of the pool of
feasible nodes.

- Douglas


On Wed, 2006-11-22 at 09:59 -0500, Justin Bronder wrote:
> We have a Moab Prologue setup on the cluster and have alerted our
> users that
> when they see the job queued in Torque and running in Moab that it
> means their
> job is currently in the prologue (yes they could use checkjob -v, but
> that hasn't 
> caught on yet).  Yesterday I was notified that a job was continuously
> bouncing
> in and out of the prologue.
> 
> The problem apparently was that one of the nodes was failing LDAP
> lookups, so
> pbs_mom was rejecting the job.  This was easy enough to track back
> from the 
> mother superior and then to the failing node.  However, Moab continued
> to
> schedule to that node despite the same failure each time.
> 
> Is there any method to get Moab to watch for this sort of scenario and
> re-calculate 
> the hosts it should run the job on?  Even marking the misbehaving node
> offline
> did not force Moab to change the node list, only forcing a recycle
> from the
> command line got the job running on a new hostlist.
> 
> Thanks,
> 
> Justin.
> _______________________________________________
> moabusers mailing list
> moabusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/moabusers



More information about the moabusers mailing list