<html>
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 10 (filtered)">
<style>
<!--
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman";}
a:link, span.MsoHyperlink
        {color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {color:purple;
        text-decoration:underline;}
span.EmailStyle17
        {font-family:Arial;
        color:windowtext;}
@page Section1
        {size:8.5in 11.0in;
        margin:1.0in 1.25in 1.0in 1.25in;}
div.Section1
        {page:Section1;}
/* List Definitions */
ol
        {margin-bottom:0in;}
ul
        {margin-bottom:0in;}
-->
</style>
</head>
<body lang=EN-US link=blue vlink=purple>
<div class=Section1>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Hi!</span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'> </span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>I saw the following situation on our cluster: one node was
rebooted, and after it came back online it reported that all three jobs, which
were running on it before the reboot, are finished successfully.</span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'> </span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>So my questions are:</span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'> </span></font></p>
<ol start=1 type=1>
<li class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Suppose a node goes down forever. And suppose that an
application, which submitted it, does ‘qsub –f’ periodically
to find out the status (with keep_completed set). When, if ever, Torque
will report the job as failed?</span></font></li>
<li class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Suppose that node comes back after a while (before
Torque gives up waiting), shouldn’t pbs_mom report all jobs as
failed if they are indeed gone?</span></font></li>
</ol>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'> </span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Note: we start pbs_mom with –p option, which we
understood as to keep running jobs (if any).</span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'> </span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Thanks.</span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'> </span></font></p>
</div>
</body>
</html>