[Moabusers] Moab keeps on trying after pbs_mom rejects.

Justin Bronder jsbronder at gmail.com
Wed Nov 22 12:12:41 MST 2006


Here is the pbs_server.log with the job in question:

11/22/2006
07:58:49;0008;PBS_Server;Job;2835.echelon.acrl.clusters.umaine.edu;Job
Queued at request of jbronder at fawlty.acrl.clusters.umaine.edu, owner =
jbronder at fawlty.acrl.clusters.umaine.edu, job name = go.darwin, queue =
darwin-admin
11/22/2006
07:59:19;0008;PBS_Server;Job;2835.echelon.acrl.clusters.umaine.edu;Job
Modified at request of root at echelon.acrl.clusters.umaine.edu
11/22/2006
07:59:19;0008;PBS_Server;Job;2835.echelon.acrl.clusters.umaine.edu;Job Run
at request of root at echelon.acrl.clusters.umaine.edu
11/22/2006
07:59:19;0008;PBS_Server;Job;2835.echelon.acrl.clusters.umaine.edu;Job
Modified at request of root at echelon.acrl.clusters.umaine.edu
11/22/2006
07:59:23;0009;PBS_Server;Job;2835.echelon.acrl.clusters.umaine.edu;obit
received for job 2835.echelon.acrl.clusters.umaine.edu from host
node164.acrl.clusters.umaine.edu with bad state (state: QUEUED)

Finally, I check back in on my job and notice what is happening, so I delete
it.
11/22/2006
09:34:01;0008;PBS_Server;Job;2835.echelon.acrl.clusters.umaine.edu;Job
deleted at request of root at echelon.acrl.clusters.umaine.edu

At this stage, my prologue verifies that the job was being sent to is
approximately
every minute and was successfully passing through.

We can check with the mother superior:
11/22/2006 07:59:19;0008;
pbs_mom;Job;2835.echelon.acrl.clusters.umaine.edu;Job Modified at request of
PBS_Server at echelon.acrl.clusters.umaine.edu
11/22/2006 07:59:23;0001;   pbs_mom;Svr;pbs_mom;Bad UID for job execution
(15023) in 2835.echelon.acrl.clusters.umaine.edu, job_start_error from node
10.0.2.33:15003 in job_start_error

And the problem node:
11/22/2006 07:59:23;0008;
pbs_mom;Job;2835.echelon.acrl.clusters.umaine.edu;No Password Entry for User
jbronder
11/22/2006 07:59:23;0008;
pbs_mom;Job;2835.echelon.acrl.clusters.umaine.edu;ERROR:    received request
'ABORT_JOB' from 10.0.2.36:1023 for job '
2835.echelon.acrl.clusters.umaine.edu' (job does not exist locally)
11/22/2006 07:59:23;0008;
pbs_mom;Job;2835.echelon.acrl.clusters.umaine.edu;ERROR:    received request
'ABORT_JOB' from 10.0.2.36:1023 for job '
2835.echelon.acrl.clusters.umaine.edu' (job does not exist locally)


So if anything it appears that the pbs_server is getting the error, but
ignores it
as the state of the job is supposed to be queued.  Should I forward this on
to the
Torque list?

Here's an example of the moab log, which verifies that it was calling the
prologue
quite a bit.  I include my canceljob request:

11/22 09:30:44 ERROR:    reservation created for reserved job '2835'
(existing reservation '2835' deleted)
11/22 09:31:14 ERROR:    reservation created for reserved job '2835'
(existing reservation '2835' deleted)
11/22 09:31:44 INFO:     trigger 8607 launched job 2835
/usr/local/sbin/nb_ctl -j $JOBID -h $HOSTLIST -i
darwin:acrl-gentoo-darwin-v1
11/22 09:31:44 ERROR:    reservation created for reserved job '2835'
(existing reservation '2835' deleted)
11/22 09:32:14 ERROR:    reservation created for reserved job '2835'
(existing reservation '2835' deleted)
11/22 09:32:44 INFO:     trigger 8607 launched job 2835
/usr/local/sbin/nb_ctl -j $JOBID -h $HOSTLIST -i
darwin:acrl-gentoo-darwin-v1
11/22 09:32:44 ERROR:    reservation created for reserved job '2835'
(existing reservation '2835' deleted)
11/22 09:33:14 ERROR:    reservation created for reserved job '2835'
(existing reservation '2835' deleted)
11/22 09:33:44 INFO:     trigger 8607 launched job 2835
/usr/local/sbin/nb_ctl -j $JOBID -h $HOSTLIST -i
darwin:acrl-gentoo-darwin-v1
11/22 09:33:44 ERROR:    reservation created for reserved job '2835'
(existing reservation '2835' deleted)
11/22 09:34:01 MRMJobCancel(2835,,EMsg,SC)
11/22 09:34:01 MPBSJobCancel(2835,base,CMsg,EMsg,)
11/22 09:34:01 INFO:     job '2835' successfully cancelled
11/22 09:34:01 INFO:     active PBS job 2835 has been removed from the
queue.  assuming successful completion



Thanks,
Justin.


On 11/22/06, wightman <wightman at clusterresources.com> wrote:
>
> Can you tell from the pbs_server  logs that the  server nodes which node
> is causing the problem?
>
> - Douglas
>
> On Wed, 2006-11-22 at 11:29 -0500, Justin Bronder wrote:
> > According to the documentation and mschedctl -l, we have the default
> > of five minutes
> > already set.  These jobs are being resubmitted to the prologue script
> > every minute
> > or less.  The prologue signals success, but the nodes reject the job.
> > Next iteration
> > of Moab, the job is sent to the prologue again, with the same
> > hostlist.
> >
> > So I would assume that Moab either doesn't know the job is getting
> > rejected, which
> > seems strange as the pbs_mom's are correctly reporting errors in their
> > logs, or
> > somehow Moab is failing to realize that we have a misbehaving node.
> >
> > -Justin.
> >
> > On 11/22/06, wightman <wightman at clusterresources.com> wrote:
> >         Have a look at:
> >
> >
> http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml#nodefailurereservetime
> >
> >         When Moab knows which node is causing problems this parameter
> >         will tell
> >         Moab to put a reservation on the node, thus taking it out of
> >         the pool of
> >         feasible nodes.
> >
> >         - Douglas
> >
> >
> >         On Wed, 2006-11-22 at 09:59 -0500, Justin Bronder wrote:
> >         > We have a Moab Prologue setup on the cluster and have
> >         alerted our
> >         > users that
> >         > when they see the job queued in Torque and running in Moab
> >         that it
> >         > means their
> >         > job is currently in the prologue (yes they could use
> >         checkjob -v, but
> >         > that hasn't
> >         > caught on yet).  Yesterday I was notified that a job was
> >         continuously
> >         > bouncing
> >         > in and out of the prologue.
> >         >
> >         > The problem apparently was that one of the nodes was failing
> >         LDAP
> >         > lookups, so
> >         > pbs_mom was rejecting the job.  This was easy enough to
> >         track back
> >         > from the
> >         > mother superior and then to the failing node.  However, Moab
> >         continued
> >         > to
> >         > schedule to that node despite the same failure each time.
> >         >
> >         > Is there any method to get Moab to watch for this sort of
> >         scenario and
> >         > re-calculate
> >         > the hosts it should run the job on?  Even marking the
> >         misbehaving node
> >         > offline
> >         > did not force Moab to change the node list, only forcing a
> >         recycle
> >         > from the
> >         > command line got the job running on a new hostlist.
> >         >
> >         > Thanks,
> >         >
> >         > Justin.
> >         > _______________________________________________
> >         > moabusers mailing list
> >         > moabusers at supercluster.org
> >         > http://www.supercluster.org/mailman/listinfo/moabusers
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/moabusers/attachments/20061122/98884c55/attachment.html


More information about the moabusers mailing list