[Mauiusers] Re: [torqueusers] One queue (fast) works - one queue
doesn't (medium)
Thomas H Dr Pierce
TPierce at rohmhaas.com
Thu May 1 08:12:13 MDT 2008
Hi Steve,
Thanks for thinking about this.
But there are other nodes in the medium queue that do not have a load.
I think it is maui-3.2.6p20, with a problem in multiple queues somewhere
in the config. I am considering
installing Maui p18, which used to work with multiple queues..
===========================================================
[ ~]$ qstat
Job id Name User Time Use S Queue
------------------- ---------------- --------------- -------- - -----
392.ralphie ba 6xwwww 0 Q medium
395.ralphie Ba xxxxxx 22:22:03 R fast
396.ralphie Ba eeeeee 22:21:18 R fast
397.ralphie Ba ffffff 20:39:29 R fast
[ ~]$ checknode node41
checking node node41
State: Idle (in current state for 18:06:02)
Configured Resources: PROCS: 2 MEM: 7977M SWAP: 7977M DISK: 1M
Utilized Resources: [NONE]
Dedicated Resources: [NONE]
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 2.010
Network: [DEFAULT]
Features: [d1850]
Attributes: [Batch]
Classes: [medium 2:2][batch 2:2]
Total Time: 1:19:41:35 Up: 1:15:39:08 (90.75%) Active: 00:02:12 (0.08%)
Reservations:
NOTE: no reservations on node
ALERT: node is in state Idle but load is high (2.010)
[ ~]$ checknode node42
checking node node42
State: Idle (in current state for 18:06:13)
Configured Resources: PROCS: 2 MEM: 7977M SWAP: 7977M DISK: 1M
Utilized Resources: [NONE]
Dedicated Resources: [NONE]
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 0.830
Network: [DEFAULT]
Features: [d1850]
Attributes: [Batch]
Classes: [medium 2:2][batch 2:2]
Total Time: 1:19:41:46 Up: 1:19:40:54 (99.97%) Active: 00:00:00 (0.00%)
Reservations:
NOTE: no reservations on node
[ ~]$ checknode node43
checking node node43
State: Idle (in current state for 18:06:13)
Configured Resources: PROCS: 2 MEM: 7977M SWAP: 7977M DISK: 1M
Utilized Resources: [NONE]
Dedicated Resources: [NONE]
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 0.000
Network: [DEFAULT]
Features: [d1850]
Attributes: [Batch]
Classes: [medium 2:2][batch 2:2]
Total Time: 1:19:41:46 Up: 1:19:41:05 (99.97%) Active: 00:00:00 (0.00%)
Reservations:
NOTE: no reservations on node
[ ~]$ checknode node46
checking node node46
State: Idle (in current state for 18:06:13)
Configured Resources: PROCS: 2 MEM: 7977M SWAP: 7977M DISK: 1M
Utilized Resources: [NONE]
Dedicated Resources: [NONE]
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 1.090
Network: [DEFAULT]
Features: [d1850]
Attributes: [Batch]
Classes: [medium 2:2][batch 2:2]
Total Time: 1:19:41:46 Up: 18:42:50 (42.83%) Active: 00:00:00 (0.00%)
Reservations:
NOTE: no reservations on node
ALERT: node is in state Idle but load is high (1.090)
[ ~]$ checknode node47
checking node node47
State: Idle (in current state for 18:06:35)
Configured Resources: PROCS: 2 MEM: 3042M SWAP: 4881M DISK: 1M
Utilized Resources: [NONE]
Dedicated Resources: [NONE]
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 0.060
Network: [DEFAULT]
Features: [d1850]
Attributes: [Batch]
Classes: [medium 2:2][batch 2:2]
Total Time: 82:05:22:27 Up: 72:17:20:32 (88.44%) Active: 00:00:00
(0.00%)
Reservations:
NOTE: no reservations on node
------
Sincerely,
Tom Pierce
Steve Young <slyoung at hamilton.edu>
04/30/2008 07:03 PM
To
Thomas H Dr Pierce <TPierce at rohmhaas.com>
cc
Subject
Re: [torqueusers] One queue (fast) works - one queue doesn't (medium)
Hi,
If you look at your checknode output you'll notice an alert... the machine
is idle (no jobs scheduled on it) but the load is already at the max of
2.0. So because of this node46 won't get allocated until the load goes
back down.
-Steve
On Apr 30, 2008, at 4:41 PM, Thomas H Dr Pierce wrote:
Dear Schedulers,
I cannot tell if this is a Torque or a Maui issue.
Basically one queue runs jobs (the fast queue) and the other does not
(the medium queue). It seems like it wants a SWAP of 16 GB. A setting I
cannot overwrite.
Why would one queue work and the other not?
Thanks for your help.
Qstat:
Job id Name User Time Use S Queue
------------------- ---------------- --------------- -------- - -----
392.ralphie base_83 xxxxxx 0 Q medium
395.ralphie Ba yyyyyy 05:18:33 R
fast
396.ralphie Ba yyyyyy 05:18:16 R fast
397.ralphie Ba yyyyyy 03:27:32 R fast
402.ralphie cfd zzzzzz 00:25:23 R fast
404.ralphie test_8300 wwwwww 0 Q medium
checkjob 404
checking job 404
State: Idle EState: Deferred
Creds: user:rrrrrr group:users class:medium qos:DEFAULT
WallTime: 00:00:00 of 99:23:59:59
SubmitTime: Wed Apr 30 16:20:46
(Time Queued Total: 00:04:03 Eligible: 00:00:01)
Total Tasks: 2
Req[0] TaskCount: 2 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [d1850]
Dedicated Resources Per Task: PROCS: 1 MEM: 250M SWAP: 16G
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE
job is deferred. Reason: NoResources (cannot create reservation for job
'404' (intital reservation attempt)
)
Holds: Defer (hold reason: NoResources)
PE: 15.80 StartPriority: 1
cannot select job 404 for partition DEFAULT (job hold active)
===========================================================================================
the qmgr -c "p s"
# Create and define queue fast
#
create queue fast
set queue fast queue_type = Execution
set queue fast Priority = 40
set queue fast max_running = 64
set queue fast acl_host_enable = False
set queue fast acl_hosts = node19
set queue fast acl_hosts += node09
set queue fast acl_hosts += node18
set queue fast acl_hosts += node08
set queue fast resources_default.neednodes = d1950
set queue fast resources_default.nodes = 1
set queue fast resources_available.nodect = 64
set queue fast enabled = True
set queue fast started = True
#
# Create and define queue medium
#
create queue medium
set queue medium queue_type = Execution
set queue medium Priority = 40
set queue medium max_running = 10
set queue medium acl_host_enable = False
set queue medium acl_hosts = node49
set queue medium acl_hosts += node48
set queue medium acl_hosts += node41
set queue medium acl_hosts += node46
set queue medium acl_hosts += node43
set queue medium acl_hosts += node42
set queue medium resources_max.mem = 32gb
set queue medium resources_max.vmem = 32gb
set queue medium resources_default.neednodes = d1850
set queue medium resources_default.nodes = 1
set queue medium resources_available.nodect = 40
set queue medium enabled = True
set queue medium started = True
==================================================================================================
pbsnodes -a
node41
state = free
np = 2
properties = d1850
ntype = cluster
status = opsys=linux,uname=Linux node41 2.6.9-42.ELsmp #1 SMP Wed Jul
12 23:32:02 EDT 2006 x86_64,sessions=4329
4379,nsessions=2,nusers=2,idletime=33784,totmem=5678720kb,availmem=2859524kb,physmem=8169096kb,ncpus=4,loadave=1.98,netload=4294967294,state=free,jobs=?
0,rectime=1209587507
node42
state = free
np = 2
properties = d1850
ntype = cluster
status = opsys=linux,uname=Linux node42 2.6.9-42.ELsmp #1 SMP Wed Jul
12 23:32:02 EDT 2006 x86_64,sessions=4321
6593,nsessions=2,nusers=2,idletime=818682,totmem=4826752kb,availmem=1952512kb,physmem=8169096kb,ncpus=4,loadave=0.87,netload=4294967294,state=free,jobs=?
0,rectime=1209587506
node43
state = free
np = 2
properties = d1850
ntype = cluster
status = opsys=linux,uname=Linux node43 2.6.9-42.ELsmp #1 SMP Wed Jul
12 23:32:02 EDT 2006
x86_64,sessions=4319,nsessions=1,nusers=1,idletime=818333,totmem=6006400kb,availmem=5874292kb,physmem=8169096kb,ncpus=4,loadave=0.00,netload=38605147,state=free,jobs=?
0,rectime=1209587506
=====================================================================================================
checknode node46
checking node node46
State: Idle (in current state for 1:03:36)
Configured Resources: PROCS: 2 MEM: 7977M SWAP: 7977M DISK: 1M
Utilized Resources: [NONE]
Dedicated Resources: [NONE]
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 2.000
Network: [DEFAULT]
Features: [d1850]
Attributes: [Batch]
Classes: [medium 2:2][batch 2:2]
Total Time: 1:02:46:37 Up: 1:47:41 (6.70%) Active: 00:00:00 (0.00%)
Reservations:
NOTE: no reservations on node
ALERT: node is in state Idle but load is high (2.000)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20080501/69aba3da/attachment-0001.html
More information about the mauiusers
mailing list