[Moabusers] job disappeared from qstat and showq, still running on node
Pim Schravendijk
schraven at csi.tu-darmstadt.de
Mon Mar 21 12:41:51 MDT 2011
Yes!! Most of the jobs are preemptable, and the queueing of all jobs
is mostly preemption-based. As far as I know there is no epilogue
script at all, so I can't imagine anything happening there. Maybe I
can set a longer timeout for pbs in its communication with moab?
Current torque is 2.5.2, the latest one is 2.5.4, will consider
upgrading, if that helps....
checknode doesn't know the job either. There is no log output about it
too, last log output is from last month.
Total Time: 147days Up: 135days (92.17%) Active: 61:05:39:15 (41.59%)
Reservations:
11829x1 Job:Running -3:29:13 -> 2:20:30:47 (3:00:00:00)
12013x12 Job:Running -5:49:56 -> 13:18:10:04 (14:00:00:00)
12220x1 Job:Running -5:47:43 -> 2:18:12:17 (3:00:00:00)
12259x6 Job:Running -2:54:36 -> 3:05:24 (6:00:00)
12263x4 Job:Running -1:00:19 -> 4:59:41 (6:00:00)
Jobs: 11829,12013,12220,12259,12263
Thanks everybody for quick replies!!!!
--
Dr. Pim Schravendijk
Technische Universität Darmstadt
Center of Smart Interfaces
Computational Methods
Petersenstraße 32
64287 Darmstadt
schraven at csi.tu-darmstadt.de
tel - 06151 16 4478
On Mon, Mar 21, 2011 at 5:58 PM, Lloyd Brown <lloyd_brown at byu.edu> wrote:
> First off, sorry for not including the list on my last reply. I'll be
> more careful.
>
> This appears to be a preemptable/restartable job. It seems to have run
> on nodes 8, 7, 5, and 3, but now at least includes node 4. Plus, it has
> a re-run log entry.
>
> The reason I ask: I've encountered a few situations in which preemption
> can cause some confusion between the pbs_mom and the pbs_server, as to
> whether the job existed on the node or not. In my case, they were
> getting stuck in the "obit" state, not exiting or running, so I'm not
> sure it's the same thing.
>
> Also, I think the problem was exacerbated by having a long-running
> epilogue script. Are you using epilogue, and if so, how long does it
> take to run?
>
> You may also want to upgrade to the latest Torque; I don't know for
> certain if they were integrated or not, but there certainly have been a
> few communication patches flying around that may help, if that is indeed
> your problem.
>
> Just for my own sanity, what's the output of "checknode -v node4"?
>
> Lloyd
>
>
>
> On 3/21/11 10:40 AM, Pim Schravendijk wrote:
>> momctl still seems to know that job 11591 is there:
>>
>> ######################
>>
>> Host: node04/node04.cluster Version: 2.5.2 PID: 3045
>> Server[0]: enzo.cluster (192.168.0.254:1023)
>> Last Msg From Server: 5 seconds (StatusJob)
>> Last Msg To Server: 22 seconds
>> HomeDirectory: /var/spool/torque/mom_priv
>> MOM active: 1218913 seconds
>> LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
>> job[11591.enzo.cluster] state=RUNNING sidlist=21581
>> job[12013.enzo.cluster] state=RUNNING sidlist=23461
>> job[12220.enzo.cluster] state=RUNNING sidlist=
>> job[12136.enzo.cluster] state=RUNNING sidlist=
>> job[11829.enzo.cluster] state=RUNNING sidlist=
>> job[12259.enzo.cluster] state=RUNNING sidlist=
>> Assigned CPU Count: 140
>>
>> diagnostics complete
>>
>> ##############
>>
>> tracejob (see below) gives a line with no output at all, even no line
>> break, and then a "dequeueing from batch"
>>
>> 03/21/2011 00:21:19 A 03/21/2011 00:21:19 S dequeuing from
>> batch, state EXITING
>>
>> That is not a very informing error, but at least it's clear something
>> went wrong there!!!
>>
>>
>> #############
>>
>> /var/spool/torque/mom_logs/20110321: No such file or directory
>> /var/spool/torque/sched_logs/20110321: No such file or directory
>> /var/spool/torque/mom_logs/20110320: No such file or directory
>> /var/spool/torque/sched_logs/20110320: No such file or directory
>>
>> Job: 11591.enzo.cluster
>>
>> 03/20/2011 11:47:52 A 03/20/2011 11:47:52 S Job Rerun at
>> request of root at enzo.cluster
>> 03/20/2011 11:49:54 S Job Run at request of root at enzo.cluster
>> 03/20/2011 11:50:03 S Not sending email: User does not want mail
>> of this type.
>> 03/20/2011 11:50:03 A user=ealgaer group=cpc jobname=ben_290
>> queue=batch ctime=1300183115 qtime=1300183115 etime=1300183115
>> start=1300618203 owner=ealgaer at enzo.cluster
>>
>> exec_host=node08/7+node07/5+node05/22+node03/6
>> Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1
>> Resource_List.procs=4 Resource_List.qos=angie
>> Resource_List.walltime=336:00:00
>> 03/21/2011 00:20:49 A 03/21/2011 00:20:49 S Job Rerun at
>> request of root at enzo.cluster
>> 03/21/2011 00:21:08 S Job Run at request of root at enzo.cluster
>> 03/21/2011 00:21:19 A 03/21/2011 00:21:19 S dequeuing from
>> batch, state EXITING
>> 03/21/2011 00:21:19 S Email 'a' to ealgaer at enzo.cluster failed:
>> Child process 'sendmail -f root ealgaer at enzo.cluster' returned 127
>> (errno 10:No child processes)
>>
>> ##########
>>
>>
>> On Mon, Mar 21, 2011 at 5:26 PM, Lloyd Brown <lloyd_brown at byu.edu> wrote:
>>> On 3/21/11 10:22 AM, Pim Schravendijk wrote:
>>>> Dear all, in our configuration of Moab with PBS, jobs sometimes disappear.
>>>> In the current case, job 11591 doesn't show up in either qstat or
>>>> showq, but is still running on the node itself
>>>
>>>
>>> What does the pbs_mom think? By that, I mean, what is the output of
>>> either "momctl -d 0" run on the same host, or "momctl -d 0 -h
>>> nodehostname" run on another host (where "nodehostname" is the name of
>>> the host).
>>>
>>> Similarly, what do you get when you run something like "tracejob
>>> jobnumber" on the pbs_server host? Basically, what does the pbs_server
>>> think happened to the job. Depending on how long ago it happened, you
>>> may have to add the "-n somenumber" to search the logs for "somenumber"
>>> days ago.
>>>
>>>
>>>
>>> --
>>>
>>>
>>> Lloyd Brown
>>> Systems Administrator
>>> Fulton Supercomputing Lab
>>> Brigham Young University
>>> http://marylou.byu.edu
>>>
>>>
>>>
>
>
> --
>
>
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
>
>
>
More information about the moabusers
mailing list