[torquedev] Should a communication error between pbs_mom's kill a job ?
Joshua Bernstein
jbernstein at penguincomputing.com
Mon May 18 12:57:13 MDT 2009
Glen Beane wrote:
> by the way, I was already working on a job attribute called
> "fault_tolerant" that prevents TORQUE from killing a job if a sister
> node goes down. I've just about wrapped this up. A system admin
> could set the default value of this to true (I was going to make this
> a torque.cfg option)
Interesting. I like that idea. In what cases do you think this is useful.
-Joshua Bernstein
Software Engineer
Penguin Computing
More information about the torquedev
mailing list