[Moabusers] Resource : neednodes, PBS_NODEFILE vanishes if stagein requirement is specified

Brady Kimball bkimball at clusterresources.com
Thu Dec 6 10:27:23 MST 2007


Rishi,

Try using the new configure option (as of TORQUE 2.2.1) 
"--enable-force-nodefile".  This should remove the check for neednodes 
when writing the node file.  Let me know if this doesn't work.

rishi pathak wrote:
> Our configuration is as follows:
> torque version: 2.1.6
> Moab server version 5.1.0p4
> The problem we are facing is that when a job specifies a stagein 
> requirement, PBS_NODEFILE(allocated nodes) environment variable is not 
> available to the job.Below is the moab log for the job:
> 12/06 11:45:51 WARNING:  cannot set job '7142.head.compute.in 
> <http://7142.head.compute.in>' attr 'Resource_List:neednodes' to '' 
> (rc: 15001 'Unknown Job Id')
> 12/06 11:45:51 INFO:     job '7142' successfully started
> 12/06 11:45:51 INFO:     starting job '7142'
> 12/06 11:45:51 INFO:     1 jobs started on iteration 1
>
> corresponding pbs_mom log is :
> 12/06/2007 11:38:54;0080;   pbs_mom;Req;req_reject;Reject reply 
> code=15001(Unknown Job Id REJHOST=amd16.compute.in 
> <http://amd16.compute.in> MSG=modify job failed, unknown job 
> 7142.amd01.head.compute.in <http://7142.amd01.head.compute.in>), 
> aux=0, type=ModifyJob, from PBS_Server at head.compute.in 
> <mailto:PBS_Server at head.compute.in>
> 12/06/2007 11:38:54;0100;   pbs_mom;Req;;Type QueueJob request 
> received from PBS_Server at head.compute.in 
> <mailto:PBS_Server at head.compute.in>, sock=11
> 12/06/2007 11:38:54;0100;   pbs_mom;Req;;Type JobScript request 
> received from PBS_Server at amd01.npsf.cdac.ernet.in 
> <mailto:PBS_Server at amd01.npsf.cdac.ernet.in>, sock=11
> 12/06/2007 11:38:54;0100;   pbs_mom;Req;;Type ReadyToCommit request 
> received from PBS_Server at head.compute.in 
> <mailto:PBS_Server at head.compute.in>, sock=11
> 12/06/2007 11:38:54;0100;   pbs_mom;Req;;Type Commit request received 
> from PBS_Server at head.compute.in <mailto:PBS_Server at head.compute.in>, 
> sock=11
> 12/06/2007 11:38:54;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
> 7142.head.compurte.in <http://7142.head.compurte.in> started, pid = 2687
> 12/06/2007 11:38:54;0100;   pbs_mom;Req;;Type StatusJob request 
> received from PBS_Server at head.compute.in 
> <mailto:PBS_Server at head.compute.in>, sock=10
> 12/06/2007 11:38:54;0080;   
> pbs_mom;Job;7142.head.compute.in;scan_for_terminated: job 
> 7142.head.compute.in <http://7142.head.compute.in> task 1 terminated, 
> sid 2687
> 12/06/2007 11:38:54;0008;   pbs_mom;Job;7142.head.compute.in;job was 
> terminated
>
> I found some reference on this from torque mailing list, Below is the 
> actual mail content:
> ---------------------------------------BEGIN 
> MAIL--------------------------------------------------------------------
> *Garrick Staples* garrick at clusterresources.com 
> <mailto:torqueusers%40supercluster.org?Subject=%5Btorqueusers%5D%20reply%20code%3D15001...&In-Reply-To=1160587021.6100.9.camel%40skutt.ydc.se>
> On Wed, Oct 11, 2006 at 07:17:01PM +0200, ?ke Sandgren alleged:
> >/ On Wed, 2006-10-11 at 10:55 -0600, Garrick Staples wrote:
> />/ > On Wed, Oct 11, 2006 at 08:41:20AM +0200, ?ke Sandgren alleged:
>
> />/ > > On Tue, 2006-10-10 at 11:58 -0600, Garrick Staples wrote:
> />/ > > > On Tue, Oct 10, 2006 at 01:33:32PM +0200, ?ke Sandgren alleged:
> />/ > > > > Hi!
> /
> >/ > > > > 
> />/ > > > > I think this have been adressed before but i can't find any info.
> />/ > > > > 
> />/ > > > > We are getting loads of
>
> />/ > > > > pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
> />/ > > > > REJHOST=i092.hpc2n.umu.se <http://i092.hpc2n.umu.se> MSG=modify job failed, unknown job
>
> />/ > > > > 392438.ingrid-h.hpc2n.umu.se <http://392438.ingrid-h.hpc2n.umu.se>), aux=0, type=ModifyJob, from
> />/ > > > > 
> PBS_Server at ingrid-i.hpc2n.umu.se <http://www.supercluster.org/mailman/listinfo/torqueusers>
> />/ > > > > 
> />/ > > > > I think they are related to stage-in/out but exactly what should we be
> />/ > > > > looking for.
>
> />/ > > > > 
> />/ > > > > torque version ranging from 2.0.0p4 to 2.1.2.
> />/ > > > 
> />/ > > > This happens with every job, right?  And you are using maui/moab, right?
>
> />/ > > > 
> />/ > > > If so, that is maui/moab reseting the job's neednodes resource after
> />/ > > > starting the job.  This is a work-around for a mythical bug in job
>
> />/ > > > starts in OpenPBS that noone has ever been able to demonstrate to me.
> />/ > > 
> />/ > > It doesn't happen on every job, only those that do explicit stagein/out.
>
> />/ > > The attrlist is "resource" and this is what happens...
> />/ > > 
> />/ > > And yes this is with maui.
> />/ > > Jobs without the initial CopyFiles request never gets any Modify
>
> />/ > > rejects.
> />/ > 
> />/ > IIRC, it is actually a race condition.  stagein and longer prologues
> />/ > will cause the error message.  It is mostly harmless, but there are some
>
> />/ > rare bad things.  I have a patch for maui if you want (moab has
> />/ > tuneable, something like NOAUTONEEDNODE).
> />/ 
> />/ Yes definitely something i want.
> />
> / 
> />/ But isn't this something that should really be done in torque?
> />/ Shouldn't it get a jobid to the mom before starting stagein?
> /
> You'd think so, but no.  stagein happens before the job is moved to the
>
> node.  I think the idea is to allow for "pre-stagein".
> ---------------------END MAIL-------------------------------------------------
>
> I just added 'NOAUTONEEDNODE' to moab.cfg and job starts but still errors are same and PBS_NODEFILE env variable is still absent.
>
>   
>
> It seems like this is a known bug, but I was not able to find much 
> reference(and problem solution) on this.Also I couldnt find any 
> reference in moab documentation for 'NOAUTONEEDNODES' parameter 
> specified by Garrick Staples.
>
> Is this bug fixed or is there any workaround for said problem.
>
> -- 
> Regards--
> Rishi Pathak
> ------------------------------------------------------------------------
>
> _______________________________________________
> moabusers mailing list
> moabusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/moabusers
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: bkimball.vcf
Type: text/x-vcard
Size: 213 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/moabusers/attachments/20071206/0690a2f8/bkimball-0001.vcf


More information about the moabusers mailing list