[Mauiusers] Possible Maui bug preempting non-rerunnable jobs
Kevin Hildebrand
kevin at umd.edu
Tue Apr 22 07:33:55 MDT 2008
Hello, I've discovered what appears to be a bug when Maui encounters jobs
marked as preemptible, but the user has specified via qsub that the job
is non-rerunnable.
I'm not sure what SHOULD happen in this case, but what IS happening is
definitely non-desirable.
Currently, Maui is attempting to tell Torque to rerun the job, and Torque
is refusing, because the job is marked non-rerunnable. Maui is seeing
this as a resource manager failure, and is bumping the RM FailCount. With
a bunch of active non-rerunnable jobs this pushes the FailCount above
MAX_RMFAILCOUNT, and this stops Maui from processing the rest of the jobs
in the queue. The end result is that jobs back up in the queue, all of
them showing "job can run in partition DEFAULT", but unable to run.
I've temporarily worked around the problem by commenting out the line that
increments R->FailCount in MPBSJobRequeue (MPBSI.c) but that's probably
not the best solution.
Some thoughts:
1) Why is Maui trying to tell Torque to rerun a job it should already know
is non-rerunnable.
2) What is the general feeling for how non-rerunnable jobs should be
handled in a preemptible queue? Personally, I'd think either they
shouldn't be allowed in the queue, or they should be killed if they need
to be preempted.
Thanks,
Kevin Hildebrand
University of Maryland, College Park
More information about the mauiusers
mailing list