[torquedev] Some jobs not starting with Torque 2.3.1 and Moab
Chris Samuel
csamuel at vpac.org
Sat Jul 5 03:27:54 MDT 2008
Hi there,
I'm not sure if this is a Torque or Moab bug or just the result
of a change in interaction between the two, so I'm report this
to both. :-)
Torque 2.3.1 official release.
# moab --version
moab server version 5.2.3 (revision 10590)
We have a number of jobs that are not starting and are ending
up in BatchHold due to repeated failures. They are all logging
similar information:
Message[30] cannot start job on reserved resources - job cannot be started on RM base - cannot set hostlist: cannot set job '472817.tango-m.vpac.org' attr 'Resource_List:neednodes' to 'tango048' - job may have been removed externally (rc: 15001 'Unknown Job Id')
There are no log messages on tango048 relating to this job:
[root at tango048 ~]# grep 472817 /usr/spool/PBS/mom_logs/20080*
[root at tango048 ~]# grep 472817 /var/log/messages
[root at tango048 ~]#
PBS server has the log message saying:
07/05/2008 18:08:35;0008;PBS_Server;Job;472817.tango-m.vpac.org;MOM rejected modify request, error: 15001
[root at tango-m ~]# tracejob -n 2 -q 472817
Job: 472817.tango-m.vpac.org
07/04/2008 19:54:44 S enqueuing into run_1_month, state 1 hop 1
07/04/2008 19:54:44 S Job Queued at request of XXXX at tango.vpac.org, owner = XXXX at tango.vpac.org, job name = box_N3000, queue = run_1_month
07/04/2008 19:54:44 A queue=run_1_month
07/04/2008 19:54:50 S Job Run at request of root at tango-m.vpac.org
07/04/2008 19:54:58 S unable to run job, MOM rejected/rc=2
07/05/2008 11:11:14 S Holds uso released at request of root at tango-m.vpac.org
07/05/2008 11:11:19 S Job Modified at request of root at tango-m.vpac.org
07/05/2008 11:11:19 S MOM rejected modify request, error: 15001
07/05/2008 16:26:52 S Holds uso released at request of root at tango-m.vpac.org
07/05/2008 17:30:21 S Holds uso released at request of root at tango-m.vpac.org
I'm wondering if the recent changes to unbreak pbs_mom
reporting non-existant jobs has changed an assumption
that Moab was making ?
At something of a loss, any ideas ?
cheers!
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
More information about the torquedev
mailing list