Our configuration is as follows:<br>
torque version: 2.1.6<br>
Moab server version 5.1.0p4<br>
The problem we are facing is that when a job specifies a stagein
requirement, PBS_NODEFILE(allocated nodes) environment variable is not
available to the job.Below is the moab log for the job:<br>
12/06 11:45:51 WARNING: cannot set job '<a href="http://7142.head.compute.in">7142.head.compute.in</a>'
attr 'Resource_List:neednodes' to '' (rc: 15001 'Unknown Job Id')<br>
12/06 11:45:51 INFO: job '7142' successfully started<br>
12/06 11:45:51 INFO: starting job '7142'<br>
12/06 11:45:51 INFO: 1 jobs started on iteration 1<br>
<br>
corresponding pbs_mom log is : <br>
12/06/2007 11:38:54;0080; pbs_mom;Req;req_reject;Reject
reply code=15001(Unknown Job Id REJHOST=<a href="http://amd16.compute.in">amd16.compute.in</a> MSG=modify job
failed, unknown job <a href="http://7142.amd01.head.compute.in">7142.amd01.head.compute.in</a>), aux=0, type=ModifyJob,
from <a href="mailto:PBS_Server@head.compute.in">PBS_Server@head.compute.in</a><br>
12/06/2007 11:38:54;0100; pbs_mom;Req;;Type QueueJob request received from <a href="mailto:PBS_Server@head.compute.in">PBS_Server@head.compute.in</a>, sock=11<br>
12/06/2007 11:38:54;0100; pbs_mom;Req;;Type JobScript
request received from <a href="mailto:PBS_Server@amd01.npsf.cdac.ernet.in">PBS_Server@amd01.npsf.cdac.ernet.in</a>, sock=11<br>
12/06/2007 11:38:54;0100; pbs_mom;Req;;Type ReadyToCommit request received from <a href="mailto:PBS_Server@head.compute.in">PBS_Server@head.compute.in</a>, sock=11<br>
12/06/2007 11:38:54;0100; pbs_mom;Req;;Type Commit request received from <a href="mailto:PBS_Server@head.compute.in">PBS_Server@head.compute.in</a>, sock=11<br>
12/06/2007 11:38:54;0001; pbs_mom;Job;TMomFinalizeJob3;job <a href="http://7142.head.compurte.in">7142.head.compurte.in</a> started, pid = 2687<br>
12/06/2007 11:38:54;0100; pbs_mom;Req;;Type StatusJob request received from <a href="mailto:PBS_Server@head.compute.in">PBS_Server@head.compute.in</a>, sock=10<br>
12/06/2007 11:38:54;0080;
pbs_mom;Job;7142.head.compute.in;scan_for_terminated: job
<a href="http://7142.head.compute.in">7142.head.compute.in</a> task 1 terminated, sid 2687<br>
12/06/2007 11:38:54;0008; pbs_mom;Job;7142.head.compute.in;job was terminated<br>
<br>
I found some reference on this from torque mailing list, Below is the actual mail content:<br>
---------------------------------------BEGIN MAIL--------------------------------------------------------------------<br>
<b>Garrick Staples</b>
<a href="mailto:torqueusers%40supercluster.org?Subject=%5Btorqueusers%5D%20reply%20code%3D15001...&In-Reply-To=1160587021.6100.9.camel%40skutt.ydc.se" title="[torqueusers] reply code=15001...">garrick at clusterresources.com
</a><br>
<pre>On Wed, Oct 11, 2006 at 07:17:01PM +0200, ?ke Sandgren alleged:<br>><i> On Wed, 2006-10-11 at 10:55 -0600, Garrick Staples wrote:<br></i>><i> > On Wed, Oct 11, 2006 at 08:41:20AM +0200, ?ke Sandgren alleged:
<br></i>><i> > > On Tue, 2006-10-10 at 11:58 -0600, Garrick Staples wrote:<br></i>><i> > > > On Tue, Oct 10, 2006 at 01:33:32PM +0200, ?ke Sandgren alleged:<br></i>><i> > > > > Hi!<br></i>
><i> > > > > <br></i>><i> > > > > I think this have been adressed before but i can't find any info.<br></i>><i> > > > > <br></i>><i> > > > > We are getting loads of
<br></i>><i> > > > > pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id<br></i>><i> > > > > REJHOST=<a href="http://i092.hpc2n.umu.se">i092.hpc2n.umu.se</a> MSG=modify job failed, unknown job
<br></i>><i> > > > > <a href="http://392438.ingrid-h.hpc2n.umu.se">392438.ingrid-h.hpc2n.umu.se</a>), aux=0, type=ModifyJob, from<br></i>><i> > > > > <a href="http://www.supercluster.org/mailman/listinfo/torqueusers">
PBS_Server at ingrid-i.hpc2n.umu.se</a><br></i>><i> > > > > <br></i>><i> > > > > I think they are related to stage-in/out but exactly what should we be<br></i>><i> > > > > looking for.
<br></i>><i> > > > > <br></i>><i> > > > > torque version ranging from 2.0.0p4 to 2.1.2.<br></i>><i> > > > <br></i>><i> > > > This happens with every job, right? And you are using maui/moab, right?
<br></i>><i> > > > <br></i>><i> > > > If so, that is maui/moab reseting the job's neednodes resource after<br></i>><i> > > > starting the job. This is a work-around for a mythical bug in job
<br></i>><i> > > > starts in OpenPBS that noone has ever been able to demonstrate to me.<br></i>><i> > > <br></i>><i> > > It doesn't happen on every job, only those that do explicit stagein/out.
<br></i>><i> > > The attrlist is "resource" and this is what happens...<br></i>><i> > > <br></i>><i> > > And yes this is with maui.<br></i>><i> > > Jobs without the initial CopyFiles request never gets any Modify
<br></i>><i> > > rejects.<br></i>><i> > <br></i>><i> > IIRC, it is actually a race condition. stagein and longer prologues<br></i>><i> > will cause the error message. It is mostly harmless, but there are some
<br></i>><i> > rare bad things. I have a patch for maui if you want (moab has<br></i>><i> > tuneable, something like NOAUTONEEDNODE).<br></i>><i> <br></i>><i> Yes definitely something i want.<br></i>>
<i> <br></i>><i> But isn't this something that should really be done in torque?<br></i>><i> Shouldn't it get a jobid to the mom before starting stagein?<br></i><br>You'd think so, but no. stagein happens before the job is moved to the
<br>node. I think the idea is to allow for "pre-stagein".<br>---------------------END MAIL-------------------------------------------------<br><br>I just added 'NOAUTONEEDNODE' to moab.cfg and job starts but still errors are same and PBS_NODEFILE env variable is still absent.
<br></pre>
<br>
It seems like this is a known bug, but I was not able to find much
reference(and problem solution) on this.Also I couldnt find any
reference in moab documentation for 'NOAUTONEEDNODES' parameter
specified by Garrick Staples.<br>
<br>
Is this bug fixed or is there any workaround for said problem.<br>
<br>-- <br>Regards--<br>Rishi Pathak<br>