[Moabusers] Moab keeps on trying after pbs_mom rejects.
wightman
wightman at clusterresources.com
Mon Dec 4 11:12:41 MST 2006
We can't reproduce this. I believe you are referring to the
node_check_script:
http://www.clusterresources.com/wiki/doku.php?id=torque:10.2_compute_node_health_check
When the message no longer returns an ERROR, then Moab correctly places
the node back into the scheduling queue.
What are you seeing on your cluster?
- Douglas
On Mon, 2006-12-04 at 16:09 +1100, Chris Samuel wrote:
> On Thursday 23 November 2006 03:09, wightman wrote:
>
> > Have a look at:
> >
> > http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml#nodef
> >ailurereservetime
> >
> > When Moab knows which node is causing problems this parameter will tell
> > Moab to put a reservation on the node, thus taking it out of the pool of
> > feasible nodes.
>
> We've been using this successfully, but how do you tell it to mark a node back
> online again afterwards when the problems are fixed and the script no longer
> returns the error message ?
>
> We've tried clearing the messages out of the mom using momctl, but Moab seems
> to be caching them somewhere & we can't bring the nodes back again. :-(
>
> Help!
>
> cheers,
> Chris
> _______________________________________________
> moabusers mailing list
> moabusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/moabusers
More information about the moabusers
mailing list