<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Hello Garrick and others,<br>
<br>
We are running this version of maui and torque:<br>
maui-3.2.6p19<br>
torque-2.1.8<br>
<br>
And we see lots of these 15001 all the time. Sometimes the job starts
immediately after the error appears in the pbs_mom log, but some other
times the job never starts. It fails.<br>
<br>
It definetly smells like some race condition as you mentioned. <br>
Do you know if the patch you sent one year ago is already included in
some recent maui version?<br>
<br>
thanks a lot,<br>
Gonzalo<br>
<br>
Garrick Staples escribió:
<blockquote cite="mid20061011171937.GB22045@login" type="cite">
<pre wrap="">On Wed, Oct 11, 2006 at 07:17:01PM +0200, ?ke Sandgren alleged:
</pre>
<blockquote type="cite">
<pre wrap="">On Wed, 2006-10-11 at 10:55 -0600, Garrick Staples wrote:
</pre>
<blockquote type="cite">
<pre wrap="">On Wed, Oct 11, 2006 at 08:41:20AM +0200, ?ke Sandgren alleged:
</pre>
<blockquote type="cite">
<pre wrap="">On Tue, 2006-10-10 at 11:58 -0600, Garrick Staples wrote:
</pre>
<blockquote type="cite">
<pre wrap="">On Tue, Oct 10, 2006 at 01:33:32PM +0200, ?ke Sandgren alleged:
</pre>
<blockquote type="cite">
<pre wrap="">Hi!
I think this have been adressed before but i can't find any info.
We are getting loads of
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=i092.hpc2n.umu.se MSG=modify job failed, unknown job
392438.ingrid-h.hpc2n.umu.se), aux=0, type=ModifyJob, from
<a class="moz-txt-link-abbreviated" href="mailto:PBS_Server@ingrid-i.hpc2n.umu.se">PBS_Server@ingrid-i.hpc2n.umu.se</a>
I think they are related to stage-in/out but exactly what should we be
looking for.
torque version ranging from 2.0.0p4 to 2.1.2.
</pre>
</blockquote>
<pre wrap="">This happens with every job, right? And you are using maui/moab, right?
If so, that is maui/moab reseting the job's neednodes resource after
starting the job. This is a work-around for a mythical bug in job
starts in OpenPBS that noone has ever been able to demonstrate to me.
</pre>
</blockquote>
<pre wrap="">It doesn't happen on every job, only those that do explicit stagein/out.
The attrlist is "resource" and this is what happens...
And yes this is with maui.
Jobs without the initial CopyFiles request never gets any Modify
rejects.
</pre>
</blockquote>
<pre wrap="">IIRC, it is actually a race condition. stagein and longer prologues
will cause the error message. It is mostly harmless, but there are some
rare bad things. I have a patch for maui if you want (moab has
tuneable, something like NOAUTONEEDNODE).
</pre>
</blockquote>
<pre wrap="">Yes definitely something i want.
But isn't this something that should really be done in torque?
Shouldn't it get a jobid to the mom before starting stagein?
</pre>
</blockquote>
<pre wrap=""><!---->
You'd think so, but no. stagein happens before the job is moved to the
node. I think the idea is to allow for "pre-stagein".
</pre>
<pre wrap="">
<hr size="4" width="90%">
Index: src/moab/MPBSI.c
===================================================================
RCS file: /usr/local/nfs/src/cvs_repository/maui/src/moab/MPBSI.c,v
retrieving revision 1.14
diff -u -r1.14 MPBSI.c
--- src/moab/MPBSI.c        5 Nov 2005 02:42:08 -0000        1.14
+++ src/moab/MPBSI.c        23 May 2006 01:50:11 -0000
@@ -1792,6 +1792,7 @@
return(FAILURE);
}
+/*
if (MPBSJobModify(
J,
R,
@@ -1826,6 +1827,7 @@
J->Name,
HostList);
}
+*/
}
else
{
@@ -1904,7 +1906,7 @@
MJobGetName(J,NULL,R,tmpJobName,sizeof(tmpJobName),mjnRMName);
- rc = pbs_runjob(R->U.PBS.ServerSD,tmpJobName,MasterHost,NULL);
+ rc = pbs_runjob(R->U.PBS.ServerSD,tmpJobName,HostList,NULL);
if (rc != 0)
{
@@ -1928,6 +1930,7 @@
JobStartFailed = TRUE;
}
+/*
if (J->NeedNodes != NULL)
{
if (MPBSJobModify(
@@ -1949,6 +1952,7 @@
J->NeedNodes);
}
}
+*/
if (JobStartFailed == TRUE)
{
</pre>
<pre wrap="">
<hr size="4" width="90%">
_______________________________________________
torqueusers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a>
<a class="moz-txt-link-freetext" href="http://www.supercluster.org/mailman/listinfo/torqueusers">http://www.supercluster.org/mailman/listinfo/torqueusers</a>
</pre>
</blockquote>
</body>
</html>