<html><body><div style="color:#000; background-color:#fff; font-family:times new roman, new york, times, serif;font-size:12pt"><div>Hello All,</div><div> </div><div>I am still having a puzzle where a job does not start when its time arrives. It only impacts a repeating job on one queue that re-qsubs itself at end of each run at 10 or 30 mins intervals. About a couple times a week, it will get stuck at Q. Always happens during work hours, mostly before 3pm, and many times around the supposedly slow lunch hour. In the server_logs, there is odd entry a minute or two before scheduled start:</div><div> </div><div>07/09/2012 10:47:30;0008;PBS_Server;Job;6035.naboo.linnbenton.edu;Job Modified at request of <a href="mailto:rpt_prod@naboo.linnbenton.edu">rpt_prod@naboo.linnbenton.edu</a></div><div> </div><div>qstat shows Hold_Types changing from n to o. When it happens, we simply issue
QRUN on the stuck job. We average about a 1000 qsubs per day mostly using two queues (most are small jobs, 1 minute or less) . Restarting TORQUE weekly did not help. We have a busy but very simple TORQUE 2.5.6 environment (No external nodes/users, all local in a VM host under Oracle VM 2.2.2):</div><div> </div><div># uname -a<br>Linux naboo.linnbenton.edu 2.6.18-274.7.1.0.1.el5 #1 SMP Thu Oct 20 22:20:30 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux<br></div><div># qstat -q</div><div>server: naboo.linnbenton.edu</div><div>Queue Memory CPU Time Walltime Node Run Que Lm State<br>---------------- ------ -------- -------- ---- --- --- -- -----<br>sys_ban -- --
-- -- 1 17 1 E R<br>sys_srv -- -- -- -- 8 8 10 E R<br>sys_tst -- -- -- -- 0 4 1 E R<br>sys_ban_quick -- -- -- -- 0 0 1 E
R<br> ----- -----<br> 9 29<br># qmgr -c "list que sys_ban"<br>Queue sys_ban<br> queue_type = Execution<br> max_queuable = 300<br> total_jobs = 19<br> state_count = Transit:0 Queued:0 Held:0 Waiting:18
Running:0 Exiting:0<br> max_running = 1<br> resources_default.nodes = 1<br> resources_default.walltime = 168:00:00<br> mtime = Sat Jul 28 01:36:45 2012<br> resources_assigned.nodect = 0<br> enabled = True<br> started = True</div><div> </div><div># ps -ef|grep pbs<br>root 8860 1 0 Jul27 ? 00:03:32 /usr/local/sbin/pbs_mom<br>root 8865 1 0 Jul27 ? 00:00:44 /usr/local/sbin/pbs_server<br>root 8867 1 0
Jul27 ? 00:00:15 /usr/local/sbin/pbs_sched<br></div><div>During installs, I issue:</div><div>./configure --enable-docs --disable-dependency-tracking --disable-libtool-lock --with-scp # USED SINCE 2.4.5<br><br>We've upgraded several times and I am running out of ideas, so if you have a similar environment that works, I would love to see your settings? For example, what options did you 'configure' with?</div><div> </div><div>It was <span id="misspell-33"><span>suggested</span></span> to use gdb on MOM, but have not installed gdb yet.<br></div><div>Thank you, Sam.<var id="yui-ie-cursor"></var></div></div></body></html>