I've ended up using a per-job semaphore (symlink) within the epilogue to determine if the epilogue script has already run for a particular job. This way I can prevent doing certain things a second time if the epilogue gets called more than once.<br>
<br>However, it would be a nice feature if torque could give some notice to the epilogue script that it has already been run for a particular job and for what reason, i.e. sigterm or sigkill.<br><br>Kevin<br><br><div class="gmail_quote">
On Tue, Dec 22, 2009 at 12:36 PM, Al Taufer <span dir="ltr"><<a href="mailto:ataufer@adaptivecomputing.com">ataufer@adaptivecomputing.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
You might want try using qmgr to increase the value of the "kill_delay" parameter. It's default value is 2 seconds. kill_delay specifies the time the server will wait before sending the sigkill request to the mom. Increase it's value high enough so the jobs being qdel'ed have enough time to exit. This should eliminate the duplicate epilogue runs unless you encounter a job that does not respond to the sigterm.<br>
<br>
Al Taufer<br>
Adaptive Computing<br>
<div><div></div><div class="h5"><br>
----- "Kevin Van Workum" <<a href="mailto:vanw@sabalcore.com">vanw@sabalcore.com</a>> wrote:<br>
<br>
> On Sun, Dec 20, 2009 at 12:40 PM, Kevin Van Workum <<br>
> <a href="mailto:vanw@sabalcore.com">vanw@sabalcore.com</a> > wrote:<br>
><br>
><br>
><br>
> Sometimes, my epilogue script runs twice. This happens if a user<br>
> qdel's the job, but the job takes a while to exit, so a sigkill is<br>
> sent. The epilogue runs again when the sigkill is sent. However, after<br>
> some testing, this doesn't happen consistently. About 1 in 10 times.<br>
> Is this the expected behavior? How can I force torque to run the<br>
> epilogue script only once? Or maybe I can check from within my<br>
> epilogue that it has already run for this job? This is causing issues<br>
> with our internal accounting system.<br>
><br>
><br>
> It doesn't seem my message got posted, so I'm trying again.<br>
><br>
> -Kevin<br>
><br>
> --<br>
> Kevin Van Workum, PhD<br>
> Sabalcore Computing Inc.<br>
> Run your code on 500 processors.<br>
> Sign up for a free trial account.<br>
> <a href="http://www.sabalcore.com" target="_blank">www.sabalcore.com</a><br>
> 877-492-8027 ext. 11<br>
><br>
</div></div>> _______________________________________________<br>
> torqueusers mailing list<br>
> <a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a><br>
> <a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
</blockquote></div><br><br clear="all"><br>-- <br>Kevin Van Workum, PhD<br>Sabalcore Computing Inc.<br>Run your code on 500 processors.<br>Sign up for a free trial account.<br><a href="http://www.sabalcore.com">www.sabalcore.com</a><br>
877-492-8027 ext. 11<br>