[Moabusers] Moab keeps on trying after pbs_mom rejects.

wightman wightman at clusterresources.com
Mon Dec 4 11:12:41 MST 2006


We can't reproduce this.  I believe you are referring to the
node_check_script:

http://www.clusterresources.com/wiki/doku.php?id=torque:10.2_compute_node_health_check

When the message no longer returns an ERROR, then Moab correctly places
the node back into the scheduling queue.

What are you seeing on your cluster?

- Douglas

On Mon, 2006-12-04 at 16:09 +1100, Chris Samuel wrote:
> On Thursday 23 November 2006 03:09, wightman wrote:
> 
> > Have a look at:
> >
> > http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml#nodef
> >ailurereservetime
> >
> > When Moab knows which node is causing problems this parameter will tell
> > Moab to put a reservation on the node, thus taking it out of the pool of
> > feasible nodes.
> 
> We've been using this successfully, but how do you tell it to mark a node back 
> online again afterwards when the problems are fixed and the script no longer 
> returns the error message ?
> 
> We've tried clearing the messages out of the mom using momctl, but Moab seems 
> to be caching them somewhere & we can't bring the nodes back again. :-(
> 
> Help!
> 
> cheers,
> Chris
> _______________________________________________
> moabusers mailing list
> moabusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/moabusers



More information about the moabusers mailing list