[Moabusers] Force recalculation of job hostlist.
jsbronder at gmail.com
Thu Jul 27 12:10:20 MDT 2006
Excellent, that's exactly what I was looking for.
Thanks for your help,
On 7/27/06, Matthew Britt <msbritt at umich.edu> wrote:
> Have you looked at torque's health check ability? We're using this
> to verify a variety of things on each node. The node can set itself
> offline before the job is ever assigned to the node.
> - matt
> - matt
> On Jul 26, 2006, at 3:45 PM, Justin Bronder wrote:
> > At our site we have a large number of jobs that are submitted with the
> > intent of utilizing our Myrinet cards. While Torque does a great job
> > of detecting nodes that are having ethernet issues (obviously quite
> > rare),
> > to my knowledge there is not an equivalent for Myrinet.
> > At the suggestion of Chris Vaughan and Douglas Wightman, I've been
> > using the Class specific JOBPROLOG to do some preliminary checking
> > on the health of the required nodes. This is working great, but I'm
> > having trouble figuring out how to deal with reallocating the hostlist
> > for the job if a problem is found. I'd like to keep the job at the
> > top of
> > the queue if at all possible, and just mark the problem node offline.
> > Then I'd like to signal Moab to find a replacement for this node.
> > Job Preemption and re-queuing is not working as Torque still sees
> > the job as queued at this point and the nodes as free whereas Moab
> > sees the job as running and the nodes reserved.
> > Any suggestions?
> > Thanks in advance,
> > Justin Bronder.
> > _______________________________________________
> > moabusers mailing list
> > moabusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/moabusers
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the moabusers