[Moabusers] Force recalculation of job hostlist.

Matthew Britt msbritt at umich.edu
Thu Jul 27 12:02:33 MDT 2006


Have you looked at torque's health check ability?  We're using this  
to verify a variety of things on each node.   The node can set itself  
offline before the job is ever assigned to the node.

http://www.clusterresources.com/torquedocs21/10.2healthcheck.shtml

- matt



  - matt

On Jul 26, 2006, at 3:45 PM, Justin Bronder wrote:

> At our site we have a large number of jobs that are submitted with the
> intent of utilizing our Myrinet cards.  While Torque does a great job
> of detecting nodes that are having ethernet issues (obviously quite  
> rare),
> to my knowledge there is not an equivalent for Myrinet.
>
> At the suggestion of Chris Vaughan and Douglas Wightman, I've been
> using the Class specific JOBPROLOG to do some preliminary checking
> on the health of the required nodes. This is working great, but I'm
> having trouble figuring out how to deal with reallocating the hostlist
> for the job if a problem is found.  I'd like to keep the job at the  
> top of
> the queue if at all possible, and just mark the problem node offline.
> Then I'd like to signal Moab to find a replacement for this node.
>
> Job Preemption and re-queuing is not working as Torque still sees
> the job as queued at this point and the nodes as free whereas Moab
> sees the job as running and the nodes reserved.
>
> Any suggestions?
>
>
> Thanks in advance,
>
> Justin Bronder.
> _______________________________________________
> moabusers mailing list
> moabusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/moabusers



More information about the moabusers mailing list