[Moabusers] Force recalculation of job hostlist.
Justin Bronder
jsbronder at gmail.com
Wed Jul 26 13:45:36 MDT 2006
At our site we have a large number of jobs that are submitted with the
intent of utilizing our Myrinet cards. While Torque does a great job
of detecting nodes that are having ethernet issues (obviously quite rare),
to my knowledge there is not an equivalent for Myrinet.
At the suggestion of Chris Vaughan and Douglas Wightman, I've been
using the Class specific JOBPROLOG to do some preliminary checking
on the health of the required nodes. This is working great, but I'm
having trouble figuring out how to deal with reallocating the hostlist
for the job if a problem is found. I'd like to keep the job at the top of
the queue if at all possible, and just mark the problem node offline.
Then I'd like to signal Moab to find a replacement for this node.
Job Preemption and re-queuing is not working as Torque still sees
the job as queued at this point and the nodes as free whereas Moab
sees the job as running and the nodes reserved.
Any suggestions?
Thanks in advance,
Justin Bronder.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/moabusers/attachments/20060726/8bcc0812/attachment-0003.html
More information about the moabusers
mailing list