[Moabusers] Moab keeps on trying after pbs_mom rejects.

wightman wightman at clusterresources.com
Wed Nov 22 10:32:55 MST 2006


Can you tell from the pbs_server  logs that the  server nodes which node
is causing the problem?

- Douglas

On Wed, 2006-11-22 at 11:29 -0500, Justin Bronder wrote:
> According to the documentation and mschedctl -l, we have the default
> of five minutes
> already set.  These jobs are being resubmitted to the prologue script
> every minute
> or less.  The prologue signals success, but the nodes reject the job.
> Next iteration 
> of Moab, the job is sent to the prologue again, with the same
> hostlist.
> 
> So I would assume that Moab either doesn't know the job is getting
> rejected, which
> seems strange as the pbs_mom's are correctly reporting errors in their
> logs, or 
> somehow Moab is failing to realize that we have a misbehaving node.
> 
> -Justin.
> 
> On 11/22/06, wightman <wightman at clusterresources.com> wrote:
>         Have a look at:
>         
>         http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml#nodefailurereservetime
>         
>         When Moab knows which node is causing problems this parameter
>         will tell
>         Moab to put a reservation on the node, thus taking it out of
>         the pool of 
>         feasible nodes.
>         
>         - Douglas
>         
>         
>         On Wed, 2006-11-22 at 09:59 -0500, Justin Bronder wrote:
>         > We have a Moab Prologue setup on the cluster and have
>         alerted our
>         > users that
>         > when they see the job queued in Torque and running in Moab
>         that it 
>         > means their
>         > job is currently in the prologue (yes they could use
>         checkjob -v, but
>         > that hasn't
>         > caught on yet).  Yesterday I was notified that a job was
>         continuously
>         > bouncing
>         > in and out of the prologue. 
>         >
>         > The problem apparently was that one of the nodes was failing
>         LDAP
>         > lookups, so
>         > pbs_mom was rejecting the job.  This was easy enough to
>         track back
>         > from the
>         > mother superior and then to the failing node.  However, Moab
>         continued 
>         > to
>         > schedule to that node despite the same failure each time.
>         >
>         > Is there any method to get Moab to watch for this sort of
>         scenario and
>         > re-calculate
>         > the hosts it should run the job on?  Even marking the
>         misbehaving node 
>         > offline
>         > did not force Moab to change the node list, only forcing a
>         recycle
>         > from the
>         > command line got the job running on a new hostlist.
>         >
>         > Thanks,
>         >
>         > Justin.
>         > _______________________________________________ 
>         > moabusers mailing list
>         > moabusers at supercluster.org
>         > http://www.supercluster.org/mailman/listinfo/moabusers
>         
> 



More information about the moabusers mailing list