HI Brady,<br>
I tested with torque 2.2.1 .Still the node file does not get created.<br><br><div><span class="gmail_quote">On 12/6/07, <b class="gmail_sendername">Brady Kimball</b> <<a href="mailto:bkimball@clusterresources.com">bkimball@clusterresources.com
</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Rishi,<br><br>Try using the new configure option (as of TORQUE 2.2.1)<br>
"--enable-force-nodefile". This should remove the check for neednodes<br>when writing the node file. Let me know if this doesn't work.<br><br>rishi pathak wrote:<br>> Our configuration is as follows:<br>
> torque version: 2.1.6<br>> Moab server version 5.1.0p4<br>> The problem we are facing is that when a job specifies a stagein<br>> requirement, PBS_NODEFILE(allocated nodes) environment variable is not<br>> available to the
job.Below is the moab log for the job:<br>> 12/06 11:45:51 WARNING: cannot set job '<a href="http://7142.head.compute.in">7142.head.compute.in</a><br>> <<a href="http://7142.head.compute.in">http://7142.head.compute.in
</a>>' attr 'Resource_List:neednodes' to ''<br>> (rc: 15001 'Unknown Job Id')<br>> 12/06 11:45:51 INFO: job '7142' successfully started<br>> 12/06 11:45:51 INFO: starting job '7142'
<br>> 12/06 11:45:51 INFO: 1 jobs started on iteration 1<br>><br>> corresponding pbs_mom log is :<br>> 12/06/2007 11:38:54;0080; pbs_mom;Req;req_reject;Reject reply<br>> code=15001(Unknown Job Id REJHOST=
<a href="http://amd16.compute.in">amd16.compute.in</a><br>> <<a href="http://amd16.compute.in">http://amd16.compute.in</a>> MSG=modify job failed, unknown job<br>> <a href="http://7142.amd01.head.compute.in">7142.amd01.head.compute.in
</a> <<a href="http://7142.amd01.head.compute.in">http://7142.amd01.head.compute.in</a>>),<br>> aux=0, type=ModifyJob, from <a href="mailto:PBS_Server@head.compute.in">PBS_Server@head.compute.in</a><br>> <mailto:
<a href="mailto:PBS_Server@head.compute.in">PBS_Server@head.compute.in</a>><br>> 12/06/2007 11:38:54;0100; pbs_mom;Req;;Type QueueJob request<br>> received from <a href="mailto:PBS_Server@head.compute.in">PBS_Server@head.compute.in
</a><br>> <mailto:<a href="mailto:PBS_Server@head.compute.in">PBS_Server@head.compute.in</a>>, sock=11<br>> 12/06/2007 11:38:54;0100; pbs_mom;Req;;Type JobScript request<br>> received from <a href="mailto:PBS_Server@amd01.npsf.cdac.ernet.in">
PBS_Server@amd01.npsf.cdac.ernet.in</a><br>> <mailto:<a href="mailto:PBS_Server@amd01.npsf.cdac.ernet.in">PBS_Server@amd01.npsf.cdac.ernet.in</a>>, sock=11<br>> 12/06/2007 11:38:54;0100; pbs_mom;Req;;Type ReadyToCommit request
<br>> received from <a href="mailto:PBS_Server@head.compute.in">PBS_Server@head.compute.in</a><br>> <mailto:<a href="mailto:PBS_Server@head.compute.in">PBS_Server@head.compute.in</a>>, sock=11<br>> 12/06/2007 11:38:54;0100; pbs_mom;Req;;Type Commit request received
<br>> from <a href="mailto:PBS_Server@head.compute.in">PBS_Server@head.compute.in</a> <mailto:<a href="mailto:PBS_Server@head.compute.in">PBS_Server@head.compute.in</a>>,<br>> sock=11<br>> 12/06/2007 11:38:54;0001; pbs_mom;Job;TMomFinalizeJob3;job
<br>> <a href="http://7142.head.compurte.in">7142.head.compurte.in</a> <<a href="http://7142.head.compurte.in">http://7142.head.compurte.in</a>> started, pid = 2687<br>> 12/06/2007 11:38:54;0100; pbs_mom;Req;;Type StatusJob request
<br>> received from <a href="mailto:PBS_Server@head.compute.in">PBS_Server@head.compute.in</a><br>> <mailto:<a href="mailto:PBS_Server@head.compute.in">PBS_Server@head.compute.in</a>>, sock=10<br>> 12/06/2007 11:38:54;0080;
<br>> pbs_mom;Job;7142.head.compute.in;scan_for_terminated: job<br>> <a href="http://7142.head.compute.in">7142.head.compute.in</a> <<a href="http://7142.head.compute.in">http://7142.head.compute.in</a>> task 1 terminated,
<br>> sid 2687<br>> 12/06/2007 11:38:54;0008; pbs_mom;Job;7142.head.compute.in;job was<br>> terminated<br>><br>> I found some reference on this from torque mailing list, Below is the<br>> actual mail content:
<br>> ---------------------------------------BEGIN<br>> MAIL--------------------------------------------------------------------<br>> *Garrick Staples* garrick at <a href="http://clusterresources.com">clusterresources.com
</a><br>> <mailto:<a href="mailto:torqueusers%40supercluster.org?Subject=%5Btorqueusers%5D%20reply%20code%3D15001...&In-Reply-To=1160587021.6100.9.camel%40skutt.ydc.se">torqueusers%40supercluster.org?Subject=%5Btorqueusers%5D%20reply%20code%3D15001...&In-Reply-To=1160587021.6100.9.camel%40skutt.ydc.se
</a>><br>> On Wed, Oct 11, 2006 at 07:17:01PM +0200, ?ke Sandgren alleged:<br>> >/ On Wed, 2006-10-11 at 10:55 -0600, Garrick Staples wrote:<br>> />/ > On Wed, Oct 11, 2006 at 08:41:20AM +0200, ?ke Sandgren alleged:
<br>><br>> />/ > > On Tue, 2006-10-10 at 11:58 -0600, Garrick Staples wrote:<br>> />/ > > > On Tue, Oct 10, 2006 at 01:33:32PM +0200, ?ke Sandgren alleged:<br>> />/ > > > > Hi!
<br>> /<br>> >/ > > > ><br>> />/ > > > > I think this have been adressed before but i can't find any info.<br>> />/ > > > ><br>> />/ > > > > We are getting loads of
<br>><br>> />/ > > > > pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id<br>> />/ > > > > REJHOST=<a href="http://i092.hpc2n.umu.se">i092.hpc2n.umu.se</a> <<a href="http://i092.hpc2n.umu.se">
http://i092.hpc2n.umu.se</a>> MSG=modify job failed, unknown job<br>><br>> />/ > > > > <a href="http://392438.ingrid-h.hpc2n.umu.se">392438.ingrid-h.hpc2n.umu.se</a> <<a href="http://392438.ingrid-h.hpc2n.umu.se">
http://392438.ingrid-h.hpc2n.umu.se</a>>), aux=0, type=ModifyJob, from<br>> />/ > > > ><br>> PBS_Server at <a href="http://ingrid-i.hpc2n.umu.se">ingrid-i.hpc2n.umu.se</a> <<a href="http://www.supercluster.org/mailman/listinfo/torqueusers">
http://www.supercluster.org/mailman/listinfo/torqueusers</a>><br>> />/ > > > ><br>> />/ > > > > I think they are related to stage-in/out but exactly what should we be<br>> />/ > > > > looking for.
<br>><br>> />/ > > > ><br>> />/ > > > > torque version ranging from 2.0.0p4 to 2.1.2.<br>> />/ > > ><br>> />/ > > > This happens with every job, right? And you are using maui/moab, right?
<br>><br>> />/ > > ><br>> />/ > > > If so, that is maui/moab reseting the job's neednodes resource after<br>> />/ > > > starting the job. This is a work-around for a mythical bug in job
<br>><br>> />/ > > > starts in OpenPBS that noone has ever been able to demonstrate to me.<br>> />/ > ><br>> />/ > > It doesn't happen on every job, only those that do explicit stagein/out.
<br>><br>> />/ > > The attrlist is "resource" and this is what happens...<br>> />/ > ><br>> />/ > > And yes this is with maui.<br>> />/ > > Jobs without the initial CopyFiles request never gets any Modify
<br>><br>> />/ > > rejects.<br>> />/ ><br>> />/ > IIRC, it is actually a race condition. stagein and longer prologues<br>> />/ > will cause the error message. It is mostly harmless, but there are some
<br>><br>> />/ > rare bad things. I have a patch for maui if you want (moab has<br>> />/ > tuneable, something like NOAUTONEEDNODE).<br>> />/<br>> />/ Yes definitely something i want.<br>
> /><br>> /<br>> />/ But isn't this something that should really be done in torque?<br>> />/ Shouldn't it get a jobid to the mom before starting stagein?<br>> /<br>> You'd think so, but no. stagein happens before the job is moved to the
<br>><br>> node. I think the idea is to allow for "pre-stagein".<br>> ---------------------END MAIL-------------------------------------------------<br>><br>>
I just added 'NOAUTONEEDNODE' to moab.cfg and job starts but still
errors are same and PBS_NODEFILE env variable is still absent.<br>><br>><br>><br>> It seems like this is a known bug, but I was not able to find much<br>> reference(and problem solution) on this.Also I couldnt find any
<br>> reference in moab documentation for 'NOAUTONEEDNODES' parameter<br>> specified by Garrick Staples.<br>><br>> Is this bug fixed or is there any workaround for said problem.<br>><br>> --<br>> Regards--
<br>> Rishi Pathak<br>> ------------------------------------------------------------------------<br>><br>> _______________________________________________<br>> moabusers mailing list<br>> <a href="mailto:moabusers@supercluster.org">
moabusers@supercluster.org</a><br>> <a href="http://www.supercluster.org/mailman/listinfo/moabusers">http://www.supercluster.org/mailman/listinfo/moabusers</a><br>><br><br><br></blockquote></div><br><br clear="all">
<br>-- <br>Regards--<br>Rishi Pathak<br>National PARAM Supercomputing Facility<br>Center for Development of Advanced Computing(C-DAC)<br>Pune University Campus,Ganesh Khind Road<br>Pune-Maharastra