<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8">
<META NAME="GENERATOR" CONTENT="GtkHTML/3.3.2">
</HEAD>
<BODY>
<FONT SIZE="2">Hello All</FONT><BR>
<BR>
<FONT SIZE="2">On a somewhat regular basis on my 60 compute node linux cluster I have a job or two queued forever and ever.</FONT><BR>
<FONT SIZE="2">In general I notice this type of problem when I submit 30-40 jobs all at once with each job requesting 1 node 1cpu..</FONT><BR>
<FONT SIZE="2">Most of the jobs run just fine in a few minutes but one or two always "hangs" and stays queued until I qrun the job by hand..</FONT><BR>
<FONT SIZE="2">Can anyone help me figure out why this is consistently happening..</FONT><BR>
It seems that everytime this happens I see in the torque log right after my troublesome job this note<BR>
<BR>
<FONT SIZE="2">08/23/2006 12:31:25;0001;PBS_Server;Svr;PBS_Server;Connection refused (111) in contact_sched, Could not contact Scheduler - port 15004</FONT><BR>
<BR>
Does the above mean my PBS_SERVER crashed for an instant?<BR>
Why when the system comes back online, if in fact there is a hicup, doesn't my job just get run?<BR>
<BR>
<BR>
<FONT SIZE="2">The case example is located below</FONT><BR>
<BR>
<FONT SIZE="2">Thanks for any help!</FONT><BR>
<FONT SIZE="2">Dan</FONT><BR>
<BR>
<BR>
<FONT SIZE="2">Job id Name User Time Use S Queue</FONT><BR>
<FONT SIZE="2">---------------- ---------------- ---------------- -------- - -----</FONT><BR>
<FONT SIZE="2">40003.tucslnxc1- TL059 nm31306 0 Q ghts </FONT><BR>
<FONT SIZE="2">[root@tucslnxc1-b log]# </FONT><BR>
<BR>
<BR>
<BR>
<FONT SIZE="2">[root@tucslnxc1-b log]# checkjob -v 40003</FONT><BR>
<BR>
<BR>
<FONT SIZE="2">checking job 40003 (RM job '40003.tu¨Ô')</FONT><BR>
<BR>
<FONT SIZE="2">State: Idle EState: Deferred</FONT><BR>
<FONT SIZE="2">Creds: user:nm31306 group:compchem class:ghts qos:medium</FONT><BR>
<FONT SIZE="2">WallTime: 00:00:00 of 99:23:59:59</FONT><BR>
<FONT SIZE="2">SubmitTime: Wed Aug 23 12:31:25</FONT><BR>
<FONT SIZE="2"> (Time Queued Total: 17:04:10 Eligible: 00:00:00)</FONT><BR>
<BR>
<FONT SIZE="2">StartDate: -00:00:54 Thu Aug 24 05:34:41</FONT><BR>
<FONT SIZE="2">Total Tasks: 1</FONT><BR>
<BR>
<FONT SIZE="2">Req[0] TaskCount: 1 Partition: ALL</FONT><BR>
<FONT SIZE="2">Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0</FONT><BR>
<FONT SIZE="2">Opsys: [NONE] Arch: [NONE] Features: [NONE]</FONT><BR>
<FONT SIZE="2">Exec: '' ExecSize: 0 ImageSize: 0</FONT><BR>
<FONT SIZE="2">Dedicated Resources Per Task: PROCS: 1</FONT><BR>
<FONT SIZE="2">NodeAccess: SHARED</FONT><BR>
<FONT SIZE="2">NodeCount: 0</FONT><BR>
<BR>
<BR>
<FONT SIZE="2">IWD: [NONE] Executable: [NONE]</FONT><BR>
<FONT SIZE="2">Bypass: 0 StartCount: 18</FONT><BR>
<FONT SIZE="2">PartitionMask: [ALL]</FONT><BR>
<FONT SIZE="2">SystemQueueTime: Thu Aug 24 05:34:40</FONT><BR>
<BR>
<FONT SIZE="2">Flags: RESTARTABLE</FONT><BR>
<BR>
<FONT SIZE="2">job is deferred. Reason: RMFailure (job cannot be started - cannot set hostlist)</FONT><BR>
<FONT SIZE="2">Holds: Defer (hold reason: RMFailure)</FONT><BR>
<FONT SIZE="2">PE: 1.00 StartPriority: 1010</FONT><BR>
<FONT SIZE="2">cannot select job 40003 for partition DEFAULT (job hold active)</FONT><BR>
<BR>
<FONT SIZE="2">I see this as well></FONT><BR>
<BR>
<FONT SIZE="2">[root@tucslnxc1-b log]# ls -l /var/spool/torque/server_priv/jobs/</FONT><BR>
<FONT SIZE="2">total 8</FONT><BR>
<FONT SIZE="2">-rw------- 1 root root 2659 Aug 23 12:31 40003.tucsl.JB</FONT><BR>
<FONT SIZE="2">-rw------- 1 root root 196 Aug 23 12:31 40003.tucsl.SC</FONT><BR>
<BR>
<BR>
<BR>
<FONT SIZE="2">The torque log shows nothing in regards to this job beyond>></FONT><BR>
<BR>
<FONT SIZE="2">08/23/2006 12:31:25;0100;PBS_Server;Job;40003.tucslnxc1-b.tuc.pharma.aventis.com;enq</FONT><BR>
<FONT SIZE="2">ueuing into ghts, state 1 hop 1</FONT><BR>
<FONT SIZE="2">08/23/2006 12:31:25;0008;PBS_Server;Job;40003.tucslnxc1-b.tuc.pharma.aventis.com;Job</FONT><BR>
<FONT SIZE="2"> Queued at request of nm31306@tucslnxc1-b.tuc.pharma.aventis.com, owner = <A HREF="mailto:nm31306@tu">nm31306@tu</A></FONT><BR>
<FONT SIZE="2">cslnxc1-b.tuc.pharma.aventis.com, job name = TL059, queue = ghts</FONT><BR>
<FONT SIZE="2"><B>08/23/2006 12:31:25;0001;PBS_Server;Svr;PBS_Server;Connection refused (111) in conta</B></FONT><BR>
<FONT SIZE="2"><B>ct_sched, Could not contact Scheduler - port 15004</B></FONT><BR>
<FONT SIZE="2">08/23/2006 12:31:26;0001;PBS_Server;Svr;PBS_Server;is_request, bad attempt to connec</FONT><BR>
<FONT SIZE="2">t from 192.168.1.5:1023 (address not trusted)</FONT><BR>
<FONT SIZE="2">08/23/2006 12:31:35;0100;PBS_Server;Req;;Type AuthenticateUser request received from</FONT><BR>
<FONT SIZE="2"> nm31306@tucslnxc1-b.tuc.pharma.aventis.com, sock=10</FONT><BR>
<FONT SIZE="2">08/23/2006 12:31:35;0100;PBS_Server;Req;;Type QueueJob request received from nm31306</FONT><BR>
<FONT SIZE="2">@tucslnxc1-b.tuc.pharma.aventis.com, sock=9</FONT><BR>
<FONT SIZE="2">08/23/2006 12:31:35;0100;PBS_Server;Req;;Type JobScript request received from nm3130</FONT><BR>
<FONT SIZE="2">6@tucslnxc1-b.tuc.pharma.aventis.com, sock=9</FONT><BR>
<FONT SIZE="2">08/23/2006 12:31:35;0100;PBS_Server;Req;;Type ReadyToCommit request received from nm</FONT><BR>
<FONT SIZE="2">31306@tucslnxc1-b.tuc.pharma.aventis.com, sock=9</FONT><BR>
<FONT SIZE="2">08/23/2006 12:31:35;0100;PBS_Server;Req;;Type Commit request received from <A HREF="mailto:nm31306@t">nm31306@t</A></FONT><BR>
<FONT SIZE="2">ucslnxc1-b.tuc.pharma.aventis.com, sock=9</FONT><BR>
<FONT SIZE="2">08/23/2006 12:31:35;0100;PBS_Server;Job;40004.tucslnxc1-b.tuc.pharma.aventis.com;enq</FONT><BR>
<FONT SIZE="2">ueuing into ghts, state 1 hop 1</FONT><BR>
<FONT SIZE="2">08/23/2006 12:31:35;0008;PBS_Server;Job;40004.tucslnxc1-b.tuc.pharma.aventis.com;Job</FONT><BR>
<FONT SIZE="2"> Queued at request of nm31306@tucslnxc1-b.tuc.pharma.aventis.com, owner = <A HREF="mailto:nm31306@tu">nm31306@tu</A></FONT><BR>
<FONT SIZE="2">cslnxc1-b.tuc.pharma.aventis.com, job name = TL026, queue = ghts</FONT><BR>
<FONT SIZE="2">08/23/2006 12:31:35;0001;PBS_Server;Svr;PBS_Server;Connection refused (111) in conta</FONT><BR>
<FONT SIZE="2">ct_sched, Could not contact Scheduler - port 15004</FONT><BR>
<FONT SIZE="2">08/23/2006 12:31:45;0100;PBS_Server;Req;;Type AuthenticateUser request received from</FONT><BR>
<FONT SIZE="2"> nm31306@tucslnxc1-b.tuc.pharma.aventis.com, sock=10</FONT><BR>
<FONT SIZE="2">--More--(64%)</FONT><BR>
<BR>
<BR>
<BR>
<FONT SIZE="2">I see in my maui.log file the start priority of the job is increasing forever..</FONT><BR>
<FONT SIZE="2">example</FONT><BR>
<FONT SIZE="2">List,0)</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 INFO: 1 PBS jobs detected on RM tucslnxc1-b.tuc.pharma.aventi</FONT><BR>
<FONT SIZE="2">s.com</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 INFO: jobs detected: 1</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 MStatClearUsage(node,Active)</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 MClusterUpdateNodeState()</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 INFO: job '40003' Priority: 1521</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 INFO: Cred: 1000(00.0) FS: 0(00.0) Attr: 0(00.0</FONT><BR>
<FONT SIZE="2">) Serv: 521(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 MStatClearUsage([NONE],Active)</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 MResDestroy(NULL)</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 INFO: total jobs selected (ALL): 0/1 [EState: 1]</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 INFO: job '40003' Priority: 1521</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 INFO: Cred: 1000(00.0) FS: 0(00.0) Attr: 0(00.0</FONT><BR>
<FONT SIZE="2">) Serv: 521(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 MStatClearUsage([NONE],Idle)</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 MResDestroy(NULL)</FONT><BR>
<FONT SIZE="2">08/24 05:26:41 INFO: total jobs selected (ALL): 0/1 [EState: 1]</FONT><BR>
<BR>
<FONT SIZE="2"><SNIP><SNIP><SNIP><SNIP><SNIP><SNIP><SNIP><SNIP><SNIP></FONT><BR>
<FONT SIZE="2"><SNIP></FONT><BR>
<BR>
<BR>
<FONT SIZE="2">08/24 05:27:31 INFO: job '40003' Priority: 1530</FONT><BR>
<FONT SIZE="2">08/24 05:27:31 INFO: Cred: 1000(00.0) FS: 0(00.0) Attr: 0(00.0) S</FONT><BR>
<FONT SIZE="2">erv: 530(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)</FONT><BR>
<FONT SIZE="2">08/24 05:27:31 MStatClearUsage([NONE],Active)</FONT><BR>
<FONT SIZE="2">08/24 05:27:31 MResDestroy(NULL)</FONT><BR>
<FONT SIZE="2">08/24 05:27:31 INFO: total jobs selected (ALL): 0/1 [EState: 1]</FONT><BR>
<FONT SIZE="2">08/24 05:27:31 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)</FONT><BR>
<FONT SIZE="2">08/24 05:27:31 INFO: job '40003' Priority: 1530</FONT><BR>
<FONT SIZE="2">08/24 05:27:31 INFO: Cred: 1000(00.0) FS: 0(00.0) Attr: 0(00.0) S</FONT><BR>
<FONT SIZE="2">erv: 530(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)</FONT><BR>
<FONT SIZE="2">08/24 05:27:31 MStatClearUsage([NONE],Idle)</FONT><BR>
<FONT SIZE="2">08/24 05:27:31 MResDestroy(NULL)</FONT><BR>
<FONT SIZE="2">08/24 05:27:31 INFO: total jobs selected (ALL): 0/1 [EState: 1]</FONT><BR>
<FONT SIZE="2">08/24 05:27:31 MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,</FONT><BR>
<BR>
<FONT SIZE="2">and etc and etc</FONT><BR>
<BR>
<BR>
<BR>
</BODY>
</HTML>