[torquedev] Qsub wrapper to resubmit jobs exceeding resources
requested - Class project idea
KS
gridsngators at gmail.com
Mon Mar 31 10:11:12 MDT 2008
Hello All,
I am a grad student trying to find a useful class (not semester long,
just a 2 week long) project to work on for a Autonomic Computing Class I
am taking.
This is a simple idea I have and am curious to see what Torque
devs/users think of it. As a user of a cluster that uses Torque/Maui to
access it, this is one issue that I have always found to trouble me and
am thinking of a way to fix it.
I would like to build a wrapper to the qsub command that will:
1.Parse the submission script to retrieve the values used for the PBS
directives for requested CPU, Memory, walltime/cputime and stderr file
(say PBS.err which is where the PBS error message get directed to).
2.Submit the job using the qsub
Next I would like to either one of the following steps or both:
3.(A) Wait for job to finish. Parse the PBS.err file to see if I can
find a known error message such as “=>> PBS: job killed: walltime 11
exceeded limit 1”
If this is is the case, I will increase requests for the specific
resource and resubmit the job. The user can be made aware of this
through a Log message but basically saves time from having to manually
monitor each job and resubmit the job whenever necessary.
3.(B) Monitor the status of the job on the nodes that it gets assigned
to using 'top'. If I see the CPU util or Mem utilization getting really
high, I'd like to use qalter to increase the specific resource
requirement. I have not used qalter before so I need to do some tests to
see if this is really possible. From the torque documentation I see that
trying to change the resource requirements of a running job, is
'implementation-dependent'. If this is not possible I'll just stick to 3(A).
So my question are:
1.This seems so simple to me, that it is perhaps already possible to do
this. Is it?
2.Would this be a useful piece of code that others might like to use ?
3.What other kinds of errors (besides exceeding CPU, Mem, Time resource
requests) have you come across that required you to change a PBS
parameter (or some other straightforward fix that can be automated in a
script) and resubmit the job ?
I haven't as yet delved into the implementation details as yet, since I
would like make sure it's a good idea before I go ahead.
It would be great if you could share with me any comments you might have.
Thanks in advance.
-KS
P.S I am sending this to both the torqueusers and torquedev mailing
lists so that I can both perspectives on this. I hope this is not
considered spamming.
More information about the torquedev
mailing list