[Moabusers] job disappeared from qstat and showq, still running on node
Pim Schravendijk
schraven at csi.tu-darmstadt.de
Mon Mar 21 10:40:27 MDT 2011
momctl still seems to know that job 11591 is there:
######################
Host: node04/node04.cluster Version: 2.5.2 PID: 3045
Server[0]: enzo.cluster (192.168.0.254:1023)
Last Msg From Server: 5 seconds (StatusJob)
Last Msg To Server: 22 seconds
HomeDirectory: /var/spool/torque/mom_priv
MOM active: 1218913 seconds
LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
job[11591.enzo.cluster] state=RUNNING sidlist=21581
job[12013.enzo.cluster] state=RUNNING sidlist=23461
job[12220.enzo.cluster] state=RUNNING sidlist=
job[12136.enzo.cluster] state=RUNNING sidlist=
job[11829.enzo.cluster] state=RUNNING sidlist=
job[12259.enzo.cluster] state=RUNNING sidlist=
Assigned CPU Count: 140
diagnostics complete
##############
tracejob (see below) gives a line with no output at all, even no line
break, and then a "dequeueing from batch"
03/21/2011 00:21:19 A 03/21/2011 00:21:19 S dequeuing from
batch, state EXITING
That is not a very informing error, but at least it's clear something
went wrong there!!!
#############
/var/spool/torque/mom_logs/20110321: No such file or directory
/var/spool/torque/sched_logs/20110321: No such file or directory
/var/spool/torque/mom_logs/20110320: No such file or directory
/var/spool/torque/sched_logs/20110320: No such file or directory
Job: 11591.enzo.cluster
03/20/2011 11:47:52 A 03/20/2011 11:47:52 S Job Rerun at
request of root at enzo.cluster
03/20/2011 11:49:54 S Job Run at request of root at enzo.cluster
03/20/2011 11:50:03 S Not sending email: User does not want mail
of this type.
03/20/2011 11:50:03 A user=ealgaer group=cpc jobname=ben_290
queue=batch ctime=1300183115 qtime=1300183115 etime=1300183115
start=1300618203 owner=ealgaer at enzo.cluster
exec_host=node08/7+node07/5+node05/22+node03/6
Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1
Resource_List.procs=4 Resource_List.qos=angie
Resource_List.walltime=336:00:00
03/21/2011 00:20:49 A 03/21/2011 00:20:49 S Job Rerun at
request of root at enzo.cluster
03/21/2011 00:21:08 S Job Run at request of root at enzo.cluster
03/21/2011 00:21:19 A 03/21/2011 00:21:19 S dequeuing from
batch, state EXITING
03/21/2011 00:21:19 S Email 'a' to ealgaer at enzo.cluster failed:
Child process 'sendmail -f root ealgaer at enzo.cluster' returned 127
(errno 10:No child processes)
##########
On Mon, Mar 21, 2011 at 5:26 PM, Lloyd Brown <lloyd_brown at byu.edu> wrote:
> On 3/21/11 10:22 AM, Pim Schravendijk wrote:
>> Dear all, in our configuration of Moab with PBS, jobs sometimes disappear.
>> In the current case, job 11591 doesn't show up in either qstat or
>> showq, but is still running on the node itself
>
>
> What does the pbs_mom think? By that, I mean, what is the output of
> either "momctl -d 0" run on the same host, or "momctl -d 0 -h
> nodehostname" run on another host (where "nodehostname" is the name of
> the host).
>
> Similarly, what do you get when you run something like "tracejob
> jobnumber" on the pbs_server host? Basically, what does the pbs_server
> think happened to the job. Depending on how long ago it happened, you
> may have to add the "-n somenumber" to search the logs for "somenumber"
> days ago.
>
>
>
> --
>
>
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
>
>
>
More information about the moabusers
mailing list