[torquedev] torque jobs are stuck in server queue
krishna ramachandran
ramach1776 at yahoo.com
Mon Jul 21 16:39:41 MDT 2008
I have 2 small clusters (torque/moab) containing 2 and 8 nodes (with procs set to 8 for each node np=8) respectively. These 2 clusters are completely independent.
Once a while jobs are not getting dequeued from the torque server even though these jobs completed successfully and the nodes sent OBIT
On 2 node cluster when jobs fail to dequeue I consistently see this message in server log (also in tracejob output)
07/19/2008 17:39:54 S Reject reply code=15001(Unknown Job Id), aux=0,
type=JobObituary, from
pbs_mom at ac4-int2sav-004.adx.pool.ac4.yahoo.com
On 8 node cluster we see this
07/19/2008 17:42:55 S Reject reply code=15052(unknown job id after clean
init), aux=0, type=JobObituary, from
pbs_mom at ac4-int2ctpmynacluster-012.adx.pool.ac4.yahoo.com
we are running torque version 2.3.0-snap.200805071513 in a virtual environment
any suggestions on what may cause this?
Krishna
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20080721/3f68d508/attachment.html
More information about the torquedev
mailing list