[Mauiusers] Problem with Torque/Maui Interaction

Wickliffe, Blake W blake.wickliffe at aramco.com
Wed Mar 19 06:34:09 MDT 2008


Howdy,

I'm having a bit of a problem with the way Torque interacts with Maui.  I've done a lot of searching on the web and the mail archive, but I can't seem to find anyone who has had the same problem.

Basically, we have a cluster of heterogeneous nodes.  Most of them are compute nodes, but some are "master" nodes which have very high I/O capacity.  Whenever we submit a job to the cluster, we assign one I/O node (master node), and some number of CPU (or compute) nodes.  Basically, a job submission looks something like:

Echo "job.sh" | qsub -l nodes=1:master:ppn=2+128:compute:ppn=2

So far, so good.  This works as expected with Torque and the pbs_sched scheduler or Torque and Maui.

But, we'd like to make it easier for the users.  We define, in qmgr, a default queue "parallel" which has, among other things:

create queue parallel
set queue parallel queue_type = Execution
set queue parallel resources_default.neednodes = 1:master:ppn=2+128:compute:ppn=2
set queue parallel resources_default.nodect = 129
set queue parallel resources_default.nodes = 1:master:ppn=2+128:compute:ppn=2
set queue parallel enabled = True
set queue parallel started = True

This way, the job submission above becomes:

Echo "job.sh" | qsub

Still so far, so good....with pbs_sched.

Then, we replace pbs_sched with Maui and everything breaks.  If you do a checkjob on a job submitted into a Torque/Maui environment, you get:

Req[0]  TaskCount: 2  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [1][master][ppn=2+128][compute][ppn=2]

Req[1]  TaskCount: 10  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [compute]


As if Maui is interpreting EVERYTHING separated by a colon in the parallel resources_default.nodes line as a resource.  No job ever runs.

I am at my wit's end here.  Has anyone seen this before?  Better still, has anyone seen it and solved it?

Thanks in advance,

Blake Wickliffe
Saudi Aramco
ENOD/CSYS/USG HPC Team
(873-4417)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20080319/73ab2423/attachment.html


More information about the mauiusers mailing list