<br><br><div class="gmail_quote">On Wed, Jul 2, 2008 at 2:07 PM, Corey Ferrier <<a href="mailto:coreyf@clemson.edu">coreyf@clemson.edu</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div class="Ih2E3d">On Wed, Jul 02, 2008 at 10:39:23AM -0700, David Sheen wrote:<br>
>The parallel programming environments we use (e.g. MPICH) use SSH to<br>
>create processes on the sister nodes. If these jobs fail (are<br>
>deleted, the mother node crashes, etc), the spawned processes remain<br>
>on the sisters and eventually someone has to go and clean them out.<br>
>Is there any way to use epilogue scripts to keep track of these<br>
>processes and make sure they get killed properly if they need to be?<br>
><br>
<br>
</div>Because we do not place execution limits on nodes<br>
(users can have multiple jobs running on the same node<br>
and multiple users can be using the same node),<br>
we use an epilogue script which calls another script<br>
to clean up leftover processes based on the jobid.<br>
<br>
Here is the epilogue script, which runs on the mother superior node<br>
and executes as root.<br>
<br>
#!/bin/bash<br>
JOBID=$1<br>
JOBUSER=$2<br>
<br>
# get nodes involved in this job<br>
nodelist=/var/spool/torque/aux/$JOBID<br>
if [ -r $nodelist ] ; then<br>
nodes=$(sort $nodelist | uniq)<br>
else<br>
nodes=localhost<br>
fi<br>
<br>
# for each node involved in the job<br>
# kill any pids leftover from that job<br>
<br>
for i in $nodes ; do<br>
ssh $i "su -c '/var/spool/torque/mom_priv/cleanup $JOBID' $JOBUSER"<br>
done<br>
<br>
<br>
Here is the 'cleanup' script:<br>
<br>
#!/bin/bash<br>
<br>
# look in the /proc process structure and<br>
# kill all pids associated with the passed in $JOBID<br>
# this script is run as the user, not root<br>
<br>
TOKILL=$1<br>
[ -z "${TOKILL}" ] && exit 1<br>
ME=`whoami`<br>
cd /<br>
find /proc -noleaf -maxdepth 2 -name environ -user $ME |<br>
while read x; do<br>
PBS_JOBID=""<br>
if [ -e $x ]; then<br>
pid=$(basename $(dirname $x))<br>
if [ -e $x ]; then<br>
eval $(cat $x | tr '\0' '\n' | grep PBS_JOBID)<br>
if [ "${PBS_JOBID}" == "${TOKILL}" ]; then<br>
kill -9 $pid<br>
fi<br>
fi<br>
fi<br>
done<br>
<br>
<br>
- Corey</blockquote></div><br><br>such a script shoud be unnecessary if you use a TM-based job launcher for whatever flavor of MPI you use, but I guess it doesn't hurt <br>