<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.6000.16681" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT face=Arial size=2>We also strongly encourage the use of
mpiexec(1) or switches on the various "mpirun" commands that try
to</FONT></DIV>
<DIV><FONT face=Arial size=2>clean up (like the -mx-kill (or -gm-kill) switch
for MYRANET MPI codes) . But, we have found it's nice to have a
backup</FONT></DIV>
<DIV><FONT face=Arial size=2>for cleaning out orphaned processes (various
problems cause orphaned processes -- node problems, commercial codes not easily
changed to use mpiexec(1), machines </FONT></DIV>
<DIV><FONT face=Arial size=2>running other schedulers such as LSF without the
HPC options, or running no schedulers at all, ...) . Therefore, we have an
in-house script run by</FONT></DIV>
<DIV><FONT face=Arial size=2>cron(1) that is called "shouldnotbehere". In our
case it is easy to know by the host name which scheduler the node is using (if
any). If the scheduler does not respond, the script quits. But if the
scheduler does respond, then a list of users running jobs on the node is
generated by calling the appropriate schedulers's query command. A second list
of non-system users on</FONT></DIV>
<DIV><FONT face=Arial size=2>the node is then generated. If you are on the
second list but not on the first (and the process is more than three
minutes</FONT></DIV>
<DIV><FONT face=Arial size=2>old to eliminate the possibility the job started
after you queried the scheduler) then the processes are killed. All
kill(1)</FONT></DIV>
<DIV><FONT face=Arial size=2>commands are logged along with a ps(1) of the
killed process just so you can prove the wrong things are not being
killed.</FONT></DIV>
<DIV><FONT face=Arial size=2>We never give regular users UIDs under a certain
value, so it is easy to tell "users" from system IDs. that may not be
the</FONT></DIV>
<DIV><FONT face=Arial size=2>case for you.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>For interactive users on nodes without a scheduler,
a test if made if the user is using any pty or tty device and whether
he</FONT></DIV>
<DIV><FONT face=Arial size=2>has a shell running that matches his login shell.
If not, the processes are killed. This is suitable for us because almost all
work (even interactive) is initiated via a job scheduler and the
interactive nodes are only used</FONT></DIV>
<DIV><FONT face=Arial size=2>for small sessions using single-CPU processes.
</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>So basically, an attempt is made to see if you
should be on the node or not. If not, your processes are killed. This
</FONT></DIV>
<DIV><FONT face=Arial size=2>assumes the odds of you having another job on the
machine at the same time are reasonably low, which on large </FONT></DIV>
<DIV><FONT face=Arial size=2>SMP nodes is not a reasonable
assumption.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>The details would vary for you, I'm sure. But I
thought an outline of the process might be useful.</FONT></DIV></BODY></HTML>