[Moabusers] job disappeared from qstat and showq, still running on node

Pim Schravendijk schraven at csi.tu-darmstadt.de
Mon Mar 21 10:40:27 MDT 2011


momctl still seems to know that job 11591 is there:

######################

Host: node04/node04.cluster   Version: 2.5.2   PID: 3045
Server[0]: enzo.cluster (192.168.0.254:1023)
  Last Msg From Server:   5 seconds (StatusJob)
  Last Msg To Server:     22 seconds
HomeDirectory:          /var/spool/torque/mom_priv
MOM active:             1218913 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
job[11591.enzo.cluster]  state=RUNNING  sidlist=21581
job[12013.enzo.cluster]  state=RUNNING  sidlist=23461
job[12220.enzo.cluster]  state=RUNNING  sidlist=
job[12136.enzo.cluster]  state=RUNNING  sidlist=
job[11829.enzo.cluster]  state=RUNNING  sidlist=
job[12259.enzo.cluster]  state=RUNNING  sidlist=
Assigned CPU Count:     140

diagnostics complete

##############

tracejob (see below) gives a line with no output at all, even no line
break, and then a "dequeueing from batch"

03/21/2011 00:21:19  A    03/21/2011 00:21:19  S    dequeuing from
batch, state EXITING

That is not a very informing error, but at least it's clear something
went wrong there!!!


#############

/var/spool/torque/mom_logs/20110321: No such file or directory
/var/spool/torque/sched_logs/20110321: No such file or directory
/var/spool/torque/mom_logs/20110320: No such file or directory
/var/spool/torque/sched_logs/20110320: No such file or directory

Job: 11591.enzo.cluster

03/20/2011 11:47:52  A    03/20/2011 11:47:52  S    Job Rerun at
request of root at enzo.cluster
03/20/2011 11:49:54  S    Job Run at request of root at enzo.cluster
03/20/2011 11:50:03  S    Not sending email: User does not want mail
of this type.
03/20/2011 11:50:03  A    user=ealgaer group=cpc jobname=ben_290
queue=batch ctime=1300183115 qtime=1300183115 etime=1300183115
start=1300618203 owner=ealgaer at enzo.cluster

exec_host=node08/7+node07/5+node05/22+node03/6
Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1
Resource_List.procs=4 Resource_List.qos=angie
                          Resource_List.walltime=336:00:00
03/21/2011 00:20:49  A    03/21/2011 00:20:49  S    Job Rerun at
request of root at enzo.cluster
03/21/2011 00:21:08  S    Job Run at request of root at enzo.cluster
03/21/2011 00:21:19  A    03/21/2011 00:21:19  S    dequeuing from
batch, state EXITING
03/21/2011 00:21:19  S    Email 'a' to ealgaer at enzo.cluster failed:
Child process 'sendmail -f root ealgaer at enzo.cluster' returned 127
(errno 10:No child processes)

##########


On Mon, Mar 21, 2011 at 5:26 PM, Lloyd Brown <lloyd_brown at byu.edu> wrote:
> On 3/21/11 10:22 AM, Pim Schravendijk wrote:
>> Dear all, in our configuration of Moab with PBS, jobs sometimes disappear.
>> In the current case, job 11591 doesn't show up in either qstat or
>> showq, but is still running on the node itself
>
>
> What does the pbs_mom think?   By that, I mean, what is the output of
> either "momctl -d 0" run on the same host, or "momctl -d 0 -h
> nodehostname" run on another host (where "nodehostname" is the name of
> the host).
>
> Similarly, what do you get when you run something like "tracejob
> jobnumber" on the pbs_server host?  Basically, what does the pbs_server
> think happened to the job.  Depending on how long ago it happened, you
> may have to add the "-n somenumber" to search the logs for "somenumber"
> days ago.
>
>
>
> --
>
>
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
>
>
>


More information about the moabusers mailing list