dbeer at adaptivecomputing.com
Tue Jan 18 10:00:06 MST 2011
As of 2.4.8, kill delay has been fixed. However, in order to preserve expected behavior, I added the parameter $kill_delay to the mom's config file. This should be set to true for things to work.
One common problem that people run into when trying to use the kill_delay behavior is that even though the job that is being run may catch a SIGTERM, the shell in which the job is running does not. The job needs to handle this eventuality as well, or expected behavior won't be achieved (although, judging by the depth of your email perhaps you have already handled this part of the problem?).
----- Original Message -----
> Is kill_delay now works in torque? From which version, if so? As i
> know on 2.3.x it wasn't work. See
> On one cluster that i use there is a 2.3.7 installed (it's not mine
> but for me as user it would be nice to have kill_delay worked).
> If it still doesn't work - let me explain what happens (at least
> happen so in old pure OpenPBS). From mail i pointed out before it can
> conclude that SIGKILL job receives before appropriate message in
> sever. It happens because there is a next scheme:
> 1) server sends SIGTERM
> 2) pbs_mom receives signal commad from sever and sends SIGTERM to job
> 3) Job is about to delete but BEFORE termination procedure done it
> sends back to pbs_mom SIGCHLD as to it parent process.
> 4) pbs_mom receives SIGCHLD and runs it's OWN SIGKILL without server.
> 5) Job, of course, deleted in bad way.
> 6) Server after kill_delay sends finally SIGKILL but there is noone to
> kill anymore.
> I repaired that on one small cluster with pure and now everything
> works fine and i don't know what situation in torque.
> Kind regards,
> Sergey Ivanov
> torqueusers mailing list
> torqueusers at supercluster.org
Direct Line: 801-717-3386 | Fax: 801-717-3738
1656 S. East Bay Blvd. Suite #300
Provo, UT 84606
More information about the torqueusers