[torqueusers] pbs_mom communication problem causes job deletion
sjf4 at uw.edu
Wed Mar 28 16:41:25 MDT 2012
I'm running Torque 2.5.10 and Moab 6.0.8. Every few weeks or every few months, I experience a period of job deletions which seem to be caused by some communication difficulty between pbs_mom and pbs_server. A job that's been running and preempted successfully 10s of times over the course of days, is preempted and starts running again. There's some miscommunication between pbs_server and pbs_mom. pbs_server deletes the job (immediately after the delete, qstat <jobid> returns unknown job), pbs_mom receives the job deletion, but the job keeps running anyway. The most inconvenient part of this is that Moab doesn't understand what's happened and waits 15 minutes for the job start to return. I have some reason to believe this is caused by stdout files which are larger than 1MB or so, but it's not always the case. In most cases however, it seems like the transfer of the stdout file from pbs_server to pbs_mom causes some necessary subsequent connection to time out.
You can find level 10 pbs_mom logs and level 7 pbs_server logs at the below URL. Job 527014 has a small stdout file (28KB) and fails in the same way that job 526003 fails which has a large stdout file (several MB). The mom logs cover the job start. The server logs cover from the preemption to the end of the job start.
I'd greatly appreciate anyone with any information speaking up. Thanks,
More information about the torqueusers