[From nobody Tue Jan 12 09:49:54 2010
Return-Path: dbeer@adaptivecomputing.com
Received: from mail.adaptivecomputing.com (LHLO mail.adaptivecomputing.com)
	(192.168.0.25) by mail.adaptivecomputing.com with LMTP; Tue, 12 Jan 2010
	09:53:47 -0700 (MST)
Received: from localhost (localhost.localdomain [127.0.0.1])
	by mail.adaptivecomputing.com (Postfix) with ESMTP id CF4AC2740007
	for &lt;knielson@adaptivecomputing.com&gt;;
	Tue, 12 Jan 2010 09:53:47 -0700 (MST)
X-Virus-Scanned: amavisd-new at mail.adaptivecomputing.com
X-Spam-Flag: NO
X-Spam-Score: 0.411
X-Spam-Level: 
X-Spam-Status: No, score=0.411 tagged_above=-10 required=6.6
	tests=[ALL_TRUSTED=-1.8, AWL=0.140, BAYES_00=-2.599,
	DNS_FROM_RFC_BOGUSMX=1.482, FH_DATE_PAST_20XX=3.188] autolearn=no
Received: from mail.adaptivecomputing.com ([127.0.0.1])
	by localhost (mail.adaptivecomputing.com [127.0.0.1]) (amavisd-new,
	port 10024)
	with ESMTP id Sc+0nIQVNLHr for &lt;knielson@adaptivecomputing.com&gt;;
	Tue, 12 Jan 2010 09:53:47 -0700 (MST)
Received: from mail.adaptivecomputing.com (mail.adaptivecomputing.com
	[192.168.0.25])
	by mail.adaptivecomputing.com (Postfix) with ESMTP id 9E6D42740003
	for &lt;knielson@adaptivecomputing.com&gt;;
	Tue, 12 Jan 2010 09:53:47 -0700 (MST)
Date: Tue, 12 Jan 2010 09:53:47 -0700 (MST)
From: David Beer &lt;dbeer@adaptivecomputing.com&gt;
Reply-To: David Beer &lt;dbeer@adaptivecomputing.com&gt;
To: Ken Nielson &lt;knielson@adaptivecomputing.com&gt;
Message-ID: &lt;819423164.176073.1263315227628.JavaMail.root@mail.adaptivecomputing.com&gt;
In-Reply-To: &lt;162423198.176053.1263315086405.JavaMail.root@mail.adaptivecomputing.com&gt;
Subject: Fwd: MPI QDel Problem (RT 6690)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [192.168.0.14]
X-Mailer: Zimbra 6.0.4_GA_2038.RHEL5_64 (ZimbraWebClient - FF3.0
	(Linux)/6.0.4_GA_2038.RHEL5_64)


----- Forwarded Message -----
From: &quot;David Beer&quot; &lt;dbeer@adaptivecomputing.com&gt;
To: &quot;torqueusers&quot; &lt;torqueusers@supercluster.org&gt;
Sent: Tuesday, January 12, 2010 9:51:26 AM
Subject: MPI QDel Problem (RT 6690)

I'm wondering if one of you already has experienced this problem when using MPI jobs.  If someone has experience with this, I would greatly appreciate it.  I am looking into the kill_delay variable, but I am curious if one of you has perhaps another workaround.

Thanks,

David Beer

----quoted text below----

We experience the following: If a user kills his ParaStation MPI job via
qdel, apparently the following happens:

1. The application gets a sigterm
2. The ParaStation MPI shepherd (psid) starts cleaning up all the
processes started via mpi exec, this might take a minute or two.
3. During this, the MPI shepherd gets a sigkill via PBS before the
processes under its control are removed, so it can not tidy up properly.
4. Orphanded MPI processes are left on the nodes.
5. PBS considers the nodes as free again, however the ParaStation still
sees the orphaned jobs and says &quot;no good&quot; to the next MPI jobs, which
consequently crashes because of lack of resources.

As a workaround, we've incorporated checking for orphaned processes in
the prologue and epilogue scripts, so we can set the nodes affected
offline to prevent further crashes of jobs.

We've then tried to use the kill_delay variable with a value of 120
seconds to give the MPI shepherd (psid) ample time to do the cleaning
up. This doesn't appear to work, though, as my colleague reports:

&gt;Obviously kill_delay does not work as expected. Again, 28 nodes were
&gt;set offline due to left-over processes in state D disappearing soon
&gt;afterwards.

&gt;Looking into mother-superiors log shows:

&gt;01/05/2010 13:41:44;0008; pbs_mom;Job;113708.jj28b01;Job Modified at
&gt;request of PBS_Server@jj28b01
&gt;01/05/2010 13:42:23;0001; pbs_mom;Job;TMomFinalizeJob3;job
&gt;113708.jj28b01 started, pid = 5580
&gt;01/05/2010 14:31:50;0008; pbs_mom;Job;113708.jj28b01;kill_task:
&gt;killing pid 5580 task 1 with sig 15
&gt;01/05/2010 14:31:50;0008; pbs_mom;Job;113708.jj28b01;kill_task:
&gt;killing pid 6019 task 1 with sig 15
&gt;01/05/2010 14:31:50;0008; pbs_mom;Job;113708.jj28b01;kill_task:
&gt;killing pid 6084 task 1 with sig 15
&gt;01/05/2010 14:31:50;0008; pbs_mom;Job;113708.jj28b01;kill_task:
&gt;killing pid 6088 task 1 with sig 15
&gt;01/05/2010 14:31:50;0008; pbs_mom;Job;113708.jj28b01;kill_task:
&gt;killing pid 6088 task 1 gracefully with sig 15
&gt;01/05/2010 14:31:55;0008; pbs_mom;Job;113708.jj28b01;kill_task:
&gt;killing pid 6088 task 1 with sig 9
&gt;01/05/2010 14:31:55;0080;
&gt;pbs_mom;Job;113708.jj28b01;scan_for_terminated: job 113708.jj28b01
&gt;task 1 terminated, sid=5580
&gt;01/05/2010 14:31:55;0008; pbs_mom;Job;113708.jj28b01;job was &gt;terminated
&gt;01/05/2010 14:32:06;0080; pbs_mom;Svr;preobit_reply;top of &gt;preobit_reply
&gt;01/05/2010 14:32:06;0080;
&gt;pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
&gt;top of while loop
&gt;01/05/2010 14:32:06;0080; pbs_mom;Svr;preobit_reply;in while loop,
&gt;no error from job stat
&gt;01/05/2010 14:32:16;0008; pbs_mom;Job;113708.jj28b01;checking job
&gt;post-processing routine
&gt;01/05/2010 14:32:16;0080; pbs_mom;Job;113708.jj28b01;obit sent to &gt;server
&gt;
&gt;
&gt;I.e.the delay between sending signal 15 and signal 9 to pid 6088 is 5
&gt;seconds, not 240 as expected from Torque's configuration for all the
&gt;queues. Job 113708 was running in queue hpcff which has
&gt;kill_delay=240, too.
&gt;
&gt;To me it's unclear how terminating a job really works. Which instance
&gt;is responsible for sending the SIGKILL.

-- 
David Beer | Senior Software Engineer
Adaptive Computing



-- 
David Beer | Senior Software Engineer
Adaptive Computing

-- 
David Beer | Senior Software Engineer
Adaptive Computing

]