<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
This seemed to kind of die here, but my problem has not. <br>
<br>
If I understand correctly, the description of the design purpose
(previous epilogue attempt fails, so it tries again), then no two
epilogues for the same job should ever run simultaneously. Yet they
do. So perhaps I'm seeing a different issue than the described logic
which is intentional.<br>
<br>
I've also tried unsuccessfully to "lock" the first epilogue in place,
and abort if that lock is already in place. I'm doing this via the
lockfile utility- and for whatever reason, it's not effective in
preventing multiple epilogues to launch simultaneously for the same job.<br>
<br>
Let me explain why it's important for me that this doesn't happen- in
the epilogue, I run a health check on a GPU resource which has a
failure condition if the device is inaccessible. I'm getting loads of
false positive detections simply because the device <i>is</i>
inaccessible while another epilogue is running a health check already.
I can't seem to get effective logic in place to prevent this from
happening (I already check ps info for epilogue processes launched
against the given jobid, and it's only partially effective). My only
option is to disable my health check altogether to prevent the false
positive detection due to conflicting epilogues.<br>
<br>
I want and expect a single epilogue (or epilogue.parallel) instance per
job per node, as the documentation describes. Why is this behavior not
considered a bug??<br>
<br>
Jeremy<br>
<br>
On 2/3/2010 5:49 PM, Jeremy Enos wrote:
<blockquote cite="mid:4B6A0B8E.4070208@ncsa.uiuc.edu" type="cite">
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<title></title>
Ok- so there is design behind it. I have two epilogues trampling each
other. What is giving Torque the indication that a job exit failed?
In other words, what constitutes a job exit failure? Perhaps that's
where I should be looking to correct this.<br>
thx-<br>
<br>
Jeremy<br>
<br>
<br>
On 2/3/2010 1:28 PM, Garrick Staples wrote:
<blockquote cite="mid:20100203192814.GN5274@polop.usc.edu" type="cite">
<pre wrap="">On Wed, Feb 03, 2010 at 03:59:48AM -0600, Jeremy Enos alleged:
</pre>
<blockquote type="cite">
<pre wrap="">that I shouldn't have to. Unless of course this behavior is by design
and not an oversight, and if that's the case- I'd be curious to know why.
</pre>
</blockquote>
<pre wrap="">Because the previous job exit failed and it needs to be done again.
</pre>
<pre wrap=""><fieldset class="mimeAttachmentHeader"></fieldset>
_______________________________________________
torqueusers mailing list
<a moz-do-not-send="true" class="moz-txt-link-abbreviated"
href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a>
<a moz-do-not-send="true" class="moz-txt-link-freetext"
href="http://www.supercluster.org/mailman/listinfo/torqueusers">http://www.supercluster.org/mailman/listinfo/torqueusers</a>
</pre>
</blockquote>
<pre wrap="">
<fieldset class="mimeAttachmentHeader"></fieldset>
_______________________________________________
torqueusers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a>
<a class="moz-txt-link-freetext" href="http://www.supercluster.org/mailman/listinfo/torqueusers">http://www.supercluster.org/mailman/listinfo/torqueusers</a>
</pre>
</blockquote>
</body>
</html>