[Moabusers] Moab keeps on trying after pbs_mom rejects.
Justin Bronder
jsbronder at gmail.com
Wed Nov 22 07:59:08 MST 2006
We have a Moab Prologue setup on the cluster and have alerted our users that
when they see the job queued in Torque and running in Moab that it means
their
job is currently in the prologue (yes they could use checkjob -v, but that
hasn't
caught on yet). Yesterday I was notified that a job was continuously
bouncing
in and out of the prologue.
The problem apparently was that one of the nodes was failing LDAP lookups,
so
pbs_mom was rejecting the job. This was easy enough to track back from the
mother superior and then to the failing node. However, Moab continued to
schedule to that node despite the same failure each time.
Is there any method to get Moab to watch for this sort of scenario and
re-calculate
the hosts it should run the job on? Even marking the misbehaving node
offline
did not force Moab to change the node list, only forcing a recycle from the
command line got the job running on a new hostlist.
Thanks,
Justin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/moabusers/attachments/20061122/297f51d1/attachment.html
More information about the moabusers
mailing list