[torquedev] Qsub wrapper to resubmit jobs exceeding resources requested - Class project idea

KS gridsngators at gmail.com
Mon Mar 31 10:11:12 MDT 2008


Hello All,

I am a grad student trying to find a useful class (not semester long, 
just a 2 week long) project to work on for a Autonomic Computing Class I 
am taking.

This is a simple idea I have and am curious to see what Torque 
devs/users think of it. As a user of a cluster that uses Torque/Maui to 
access it, this is one issue that I have always found to trouble me and 
am thinking of a way to fix it.

I would like to build a wrapper to the qsub command that will:

1.Parse the submission script to retrieve the values used for the PBS 
directives for requested CPU, Memory, walltime/cputime and stderr file 
(say PBS.err which is where the PBS error message get directed to).
2.Submit the job using the qsub
Next I would like to either one of the following steps or both:
3.(A) Wait for job to finish. Parse the PBS.err file to see if I can 
find a known error message such as “=>> PBS: job killed: walltime 11 
exceeded limit 1”
If this is is the case, I will increase requests for the specific 
resource and resubmit the job. The user can be made aware of this 
through a Log message but basically saves time from having to manually 
monitor each job and resubmit the job whenever necessary.
3.(B) Monitor the status of the job on the nodes that it gets assigned 
to using 'top'. If I see the CPU util or Mem utilization getting really 
high, I'd like to use qalter to increase the specific resource 
requirement. I have not used qalter before so I need to do some tests to 
see if this is really possible. From the torque documentation I see that 
trying to change the resource requirements of a running job, is 
'implementation-dependent'. If this is not possible I'll just stick to 3(A).

So my question are:
1.This seems so simple to me, that it is perhaps already possible to do 
this. Is it?
2.Would this be a useful piece of code that others might like to use ?
3.What other kinds of errors (besides exceeding CPU, Mem, Time resource 
requests) have you come across that required you to change a PBS 
parameter (or some other straightforward fix that can be automated in a 
script) and resubmit the job ?

I haven't as yet delved into the implementation details as yet, since I 
would like make sure it's a good idea before I go ahead.

It would be great if you could share with me any comments you might have.

Thanks in advance.

-KS

P.S I am sending this to both the torqueusers and torquedev mailing 
lists so that I can both perspectives on this. I hope this is not 
considered spamming.




More information about the torquedev mailing list