[Mauiusers] RESERVATIONDEPTH not working as I expected
Jim Lawson
jtl+supercluster at uvm.edu
Mon Apr 14 13:15:30 MDT 2008
Hello mauiusers,
At UVM's VACC we are running maui 3.2.6p19, torque 2.1.8. Things had
been working OK until I started trying to tweak how the scheduler
works... :-)
The problem I am trying to solve is: backfill-starvation. The
highest-priority job, typically a long-running job needing lots of
processors, does OK, but the other big jobs often have to wait a long
time, often days, to get to that first position where they get a
reservation. Meanwhile the cluster is busy with lots of little tiny jobs.
So, to get more of the larger jobs running sooner, I set
RESERVATIONDEPTH to 2, thinking that it would then make reservations for
the 2 highest priority jobs.
However, more than 2 reservations for jobs in I state are typically
created. The top 2 jobs get a reservation, plus 2 (or more!) for some
of the other, lower-priority jobs. All the jobs have of the same QOS
(DEFAULT), so I don't see how RESERVATIONQOSLIST would apply.
What's worse, something seems to be wrong with the reservations made...
Often, a lower-priority job's reservation comes due, but maui doesn't
start the job. Then the running jobs start to drain, because maui isn't
starting any jobs at all. It seems to just get "stuck".
I can get past the problem by running "runjob" to kick the job into
Running state, but then it's usually
only a few hours typically before it jams up again.
I am also noticing ALERTs showing up in my logs that may (?) be related
to this:
> 04/14 14:32:38 ALERT: node 'node028.cluster' sync from expected
> state 'Idle' to state 'Running' at Mon Apr 14 14:32:38
> 04/14 14:32:38 ALERT: node 'node029.cluster' sync from expected
> state 'Idle' to state 'Running' at Mon Apr 14 14:32:38
> 04/14 14:32:38 ALERT: node 'node030.cluster' sync from expected
> state 'Idle' to state 'Running' at Mon Apr 14 14:32:38
> 04/14 14:32:38 ALERT: node 'node031.cluster' sync from expected
> state 'Idle' to state 'Running' at Mon Apr 14 14:32:38
> 04/14 14:32:38 ALERT: node 'node032.cluster' sync from expected
> state 'Idle' to state 'Running' at Mon Apr 14 14:32:38
> 04/14 14:32:38 ALERT: node 'node042.cluster' sync from expected
> state 'Idle' to state 'Running' at Mon Apr 14 14:32:38
> 04/14 14:32:38 ALERT: node 'node111.cluster' sync from expected
> state 'Idle' to state 'Running' at Mon Apr 14 14:32:38
For those willing to take a look, config and log file dumps are available at
http://www.uvm.edu/~jtl/mauiprob/
Thanks for any assistance that can be provided.
--
Jim Lawson
Systems Architecture & Administration
Enterprise Technology Services
University of Vermont
Burlington, VT USA
More information about the mauiusers
mailing list