[Moabusers] Resource : neednodes, PBS_NODEFILE vanishes if stagein
requirement is specified
Brady Kimball
bkimball at clusterresources.com
Thu Dec 6 10:27:23 MST 2007
Rishi,
Try using the new configure option (as of TORQUE 2.2.1)
"--enable-force-nodefile". This should remove the check for neednodes
when writing the node file. Let me know if this doesn't work.
rishi pathak wrote:
> Our configuration is as follows:
> torque version: 2.1.6
> Moab server version 5.1.0p4
> The problem we are facing is that when a job specifies a stagein
> requirement, PBS_NODEFILE(allocated nodes) environment variable is not
> available to the job.Below is the moab log for the job:
> 12/06 11:45:51 WARNING: cannot set job '7142.head.compute.in
> <http://7142.head.compute.in>' attr 'Resource_List:neednodes' to ''
> (rc: 15001 'Unknown Job Id')
> 12/06 11:45:51 INFO: job '7142' successfully started
> 12/06 11:45:51 INFO: starting job '7142'
> 12/06 11:45:51 INFO: 1 jobs started on iteration 1
>
> corresponding pbs_mom log is :
> 12/06/2007 11:38:54;0080; pbs_mom;Req;req_reject;Reject reply
> code=15001(Unknown Job Id REJHOST=amd16.compute.in
> <http://amd16.compute.in> MSG=modify job failed, unknown job
> 7142.amd01.head.compute.in <http://7142.amd01.head.compute.in>),
> aux=0, type=ModifyJob, from PBS_Server at head.compute.in
> <mailto:PBS_Server at head.compute.in>
> 12/06/2007 11:38:54;0100; pbs_mom;Req;;Type QueueJob request
> received from PBS_Server at head.compute.in
> <mailto:PBS_Server at head.compute.in>, sock=11
> 12/06/2007 11:38:54;0100; pbs_mom;Req;;Type JobScript request
> received from PBS_Server at amd01.npsf.cdac.ernet.in
> <mailto:PBS_Server at amd01.npsf.cdac.ernet.in>, sock=11
> 12/06/2007 11:38:54;0100; pbs_mom;Req;;Type ReadyToCommit request
> received from PBS_Server at head.compute.in
> <mailto:PBS_Server at head.compute.in>, sock=11
> 12/06/2007 11:38:54;0100; pbs_mom;Req;;Type Commit request received
> from PBS_Server at head.compute.in <mailto:PBS_Server at head.compute.in>,
> sock=11
> 12/06/2007 11:38:54;0001; pbs_mom;Job;TMomFinalizeJob3;job
> 7142.head.compurte.in <http://7142.head.compurte.in> started, pid = 2687
> 12/06/2007 11:38:54;0100; pbs_mom;Req;;Type StatusJob request
> received from PBS_Server at head.compute.in
> <mailto:PBS_Server at head.compute.in>, sock=10
> 12/06/2007 11:38:54;0080;
> pbs_mom;Job;7142.head.compute.in;scan_for_terminated: job
> 7142.head.compute.in <http://7142.head.compute.in> task 1 terminated,
> sid 2687
> 12/06/2007 11:38:54;0008; pbs_mom;Job;7142.head.compute.in;job was
> terminated
>
> I found some reference on this from torque mailing list, Below is the
> actual mail content:
> ---------------------------------------BEGIN
> MAIL--------------------------------------------------------------------
> *Garrick Staples* garrick at clusterresources.com
> <mailto:torqueusers%40supercluster.org?Subject=%5Btorqueusers%5D%20reply%20code%3D15001...&In-Reply-To=1160587021.6100.9.camel%40skutt.ydc.se>
> On Wed, Oct 11, 2006 at 07:17:01PM +0200, ?ke Sandgren alleged:
> >/ On Wed, 2006-10-11 at 10:55 -0600, Garrick Staples wrote:
> />/ > On Wed, Oct 11, 2006 at 08:41:20AM +0200, ?ke Sandgren alleged:
>
> />/ > > On Tue, 2006-10-10 at 11:58 -0600, Garrick Staples wrote:
> />/ > > > On Tue, Oct 10, 2006 at 01:33:32PM +0200, ?ke Sandgren alleged:
> />/ > > > > Hi!
> /
> >/ > > > >
> />/ > > > > I think this have been adressed before but i can't find any info.
> />/ > > > >
> />/ > > > > We are getting loads of
>
> />/ > > > > pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
> />/ > > > > REJHOST=i092.hpc2n.umu.se <http://i092.hpc2n.umu.se> MSG=modify job failed, unknown job
>
> />/ > > > > 392438.ingrid-h.hpc2n.umu.se <http://392438.ingrid-h.hpc2n.umu.se>), aux=0, type=ModifyJob, from
> />/ > > > >
> PBS_Server at ingrid-i.hpc2n.umu.se <http://www.supercluster.org/mailman/listinfo/torqueusers>
> />/ > > > >
> />/ > > > > I think they are related to stage-in/out but exactly what should we be
> />/ > > > > looking for.
>
> />/ > > > >
> />/ > > > > torque version ranging from 2.0.0p4 to 2.1.2.
> />/ > > >
> />/ > > > This happens with every job, right? And you are using maui/moab, right?
>
> />/ > > >
> />/ > > > If so, that is maui/moab reseting the job's neednodes resource after
> />/ > > > starting the job. This is a work-around for a mythical bug in job
>
> />/ > > > starts in OpenPBS that noone has ever been able to demonstrate to me.
> />/ > >
> />/ > > It doesn't happen on every job, only those that do explicit stagein/out.
>
> />/ > > The attrlist is "resource" and this is what happens...
> />/ > >
> />/ > > And yes this is with maui.
> />/ > > Jobs without the initial CopyFiles request never gets any Modify
>
> />/ > > rejects.
> />/ >
> />/ > IIRC, it is actually a race condition. stagein and longer prologues
> />/ > will cause the error message. It is mostly harmless, but there are some
>
> />/ > rare bad things. I have a patch for maui if you want (moab has
> />/ > tuneable, something like NOAUTONEEDNODE).
> />/
> />/ Yes definitely something i want.
> />
> /
> />/ But isn't this something that should really be done in torque?
> />/ Shouldn't it get a jobid to the mom before starting stagein?
> /
> You'd think so, but no. stagein happens before the job is moved to the
>
> node. I think the idea is to allow for "pre-stagein".
> ---------------------END MAIL-------------------------------------------------
>
> I just added 'NOAUTONEEDNODE' to moab.cfg and job starts but still errors are same and PBS_NODEFILE env variable is still absent.
>
>
>
> It seems like this is a known bug, but I was not able to find much
> reference(and problem solution) on this.Also I couldnt find any
> reference in moab documentation for 'NOAUTONEEDNODES' parameter
> specified by Garrick Staples.
>
> Is this bug fixed or is there any workaround for said problem.
>
> --
> Regards--
> Rishi Pathak
> ------------------------------------------------------------------------
>
> _______________________________________________
> moabusers mailing list
> moabusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/moabusers
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bkimball.vcf
Type: text/x-vcard
Size: 213 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/moabusers/attachments/20071206/0690a2f8/bkimball-0001.vcf
More information about the moabusers
mailing list