[torquedev] Should a communication error between pbs_mom's kill a job ?
glen.beane at gmail.com
Fri May 22 21:48:17 MDT 2009
so I'm pretty close to checking in the changes I mentioned below. I
started with a community patch that made a mom_config option to
control whether or not pjob->ji_nodekill is set after a POLL request
to a sister fails in mom_comm.c. If this doesn't get set, then
job_over_limit does not kill the job with the "node X requested the
job terminate" error. Someone requested this be controllable on a per
job basis, so instead of a mom config file option, it is now
controlled via a job attribute (right now called fault_tolerant). The
particular user wanted this feature to enable certain jobs to survive
the complete loss of a sister.
This attribute defaults to false, so the default behavior is opposite
of what you all want. I was going to put in a torque.cfg option to
specify that the default value should be true instead (torque.cfg is
only used by qsub, if this option is set in torque.cfg and
fault_tolerant is not specified to qsub then it would set it to true).
However, based on this conversation, I think the best thing to do
would be to get rid of this new attribute and change the mom code so
that the mother superior never sets pjob->ji_nodekill when it gets an
error from a POLL request...
On Mon, May 18, 2009 at 9:07 AM, Glen Beane <glen.beane at gmail.com> wrote:
> by the way, I was already working on a job attribute called
> "fault_tolerant" that prevents TORQUE from killing a job if a sister
> node goes down. I've just about wrapped this up. A system admin
> could set the default value of this to true (I was going to make this
> a torque.cfg option)
> Of course removing this check might make my work thus far a waste of time.
More information about the torquedev