[Mauiusers] running jobs & restarting maui
Thomas Dargel
td at chemie.hu-berlin.de
Tue Nov 8 02:20:16 MST 2005
Hi Chris,
thanks for answering my mail,
On Tue, Nov 08, 2005 at 09:59:50AM +1100, Chris Samuel wrote:
> On Tue, 8 Nov 2005 01:40 am, Thomas Dargel wrote:
>
> > sorry when I miss something in the docs, but is it normal that a
> > restart of maui kills all running jobs???
>
> No!
that's good..
>
> > How can I keep the jobs running in spite of restarting maui?
>
> What does Maui log when it kills them ?
>
after a deeper look into the log, I found this..
11/08 09:20:41 ALERT: job '561' in state 'Running' has exceeded its wallclock limit (0+S:0) by 16:43:00 (job will be cancelled)
11/08 09:20:41 MSysRegEvent(JOBWCVIOLATION: job '561' in state 'Running' has exceeded its wallclock limit (0) by 16:43:00 (job will be cancelled) job start time: Mon Nov 7 16:37:41 ,0,0,1)
11/08 09:20:41 MSysLaunchAction(ASList,1)
11/08 09:20:41 MRMJobCancel(561,MOAB_INFO: job exceeded wallclock limit ,SC)
11/08 09:20:41 MPBSJobCancel(561,node01,CMsg,Msg,MOAB_INFO: job exceeded wallclock limit)
11/08 09:20:41 INFO: job '561' successfully cancelled
Do I have to set a 'wallclock limit' in maui.cfg or when the job is submitted?
> Thinking back - are your users setting walltimes for their jobs ?
> If not - what is the default walltime you are assigning ?
>
No setting for the resources_default.walltime for the server, when using
the torque-scheduler this sets the resources_default.walltime to infinity -
that's what I need also for the maui scheduler.
> What do the output of checkjob and qstat -f look like for a sample job on your
> system ?
>
qstat -f 561
Job Id: 561.cnode01.mauicluster
Job_Name = job.dual
Job_Owner = td at cnode01.mauicluster
resources_used.cput = 16:39:49
resources_used.mem = 517212kb
resources_used.vmem = 597276kb
resources_used.walltime = 16:40:02
job_state = R
queue = cpu-2
server = cnode01.mauicluster
Checkpoint = u
ctime = Mon Nov 7 16:37:39 2005
Error_Path = cnode01.mauicluster:/huge/td/cpmd/job.dual.e561
exec_host = cnode01/1+cnode01/0
Hold_Types = n
Join_Path = eo
Keep_Files = n
Mail_Points = a
mtime = Mon Nov 7 16:37:41 2005
Output_Path = cnode01.mauicluster:/huge/td/cpmd/job.dual.o561
Priority = 0
qtime = Mon Nov 7 16:37:39 2005
Rerunable = False
Resource_List.mem = 8191mb
Resource_List.neednodes = 1:ppn=2
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=2
session_id = 705
substate = 42
Variable_List = PBS_O_HOME=/users/td,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=td,
PBS_O_PATH=/sysinst/bin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/u
sr/games:/opt/gnome/bin:/opt/kde3/bin:/apps/maui/bin:/apps/torque/bin,
PBS_O_MAIL=/var/mail/td,PBS_O_SHELL=/bin/ksh,
PBS_O_HOST=cnode01.mauicluster,PBS_O_WORKDIR=/huge/td/cpmd,
PBS_O_QUEUE=mixpipe
euser = td
egroup = qc
hashname = 561.cnode01
queue_rank = 611
queue_type = E
etime = Mon Nov 7 16:37:39 2005
checkjob -v 561
checking job 561 (RM job '561.cnode01.mauicluster')
State: Running
Creds: user:td group:qc class:cpu-2 qos:DEFAULT
WallTime: 16:41:29 of 99:23:59:59
SubmitTime: Mon Nov 7 16:37:39
(Time Queued Total: 00:00:02 Eligible: 00:00:02)
StartTime: Mon Nov 7 16:37:41
Total Tasks: 2
Req[0] TaskCount: 2 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [dual]
Exec: '' ExecSize: 0 ImageSize: 0
Dedicated Resources Per Task: PROCS: 1 MEM: 4095M
Utilized Resources Per Task: PROCS: 0.49 MEM: 2.52 SWAP: 5.83
Avg Util Resources Per Task: PROCS: 0.49
Max Util Resources Per Task: PROCS: 0.49 MEM: 2.52 SWAP: 5.83
Average Utilized Memory: 255.79 MB
Average Utilized Procs: 0.98
NodeAccess: SHARED
TasksPerNode: 2 NodeCount: 1
Allocated Nodes:
[cnode01:2]
Task Distribution: cnode01,cnode01
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Reservation '561' (-16:41:39 -> 99:07:18:20 Duration: 99:23:59:59)
PE: 2.00 StartPriority: 1
When I searched the maui.log, I found the following error-message:
ERROR: job '571' has NULL WCLimit field
Changing the XFMINWCLIMIT from "00:02:00" to "-1" makes no difference.
Any hints what I have to do?
Thank you in advance,
Thomas.
> Chris
> --
> Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
> Victorian Partnership for Advanced Computing http://www.vpac.org/
> Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers
More information about the mauiusers
mailing list