[torqueusers] Memory resource limits and rlimits on Linux
chindw at wfu.edu
Wed Oct 13 11:55:06 MDT 2010
Been looking closely at the memory limits (mem, pmem, vmem, pvmem) and
the Linux rlimits (aka ulimits in bash). We've been having jobs which
exceeded their requested memory amounts, such that they crash nodes.
I've been looking at this old thread as reference:
>From doing tests, and browsing source for torque-2.5.2, these are the
things I've discovered:
1. mom_over_limit() in src/resmom/linux/mom_mach.c does NOT check
"mem", only vmem and pvmem. The patch that Anton Starikov attached to
the old thread did not make it into the source tree.
2. When setting mem, pmem, vmem, pvmem in the Torque script, only
"pmem" actually gets translated into an rlimit ("data"). The other
three resources (mem, vmem, and pvmem) are ignored. If I understand
correctly, that's correct behavior for mem and vmem, which are summed
limits over all processes in the job. But I would have thought setting
pvmem would have set the address space (aka virtual memory) limit.
3. While torque does cancel a job if it runs over its walltime
request, torque does nothing about jobs which run over their mem
request. It leaves that to the scheduler to cancel.
Is this how things are supposed to be? It seems to me that points 1
and 2 indicate bugs. (If it matters, I use maui-3.3 compiled against
David Chin, Ph.D.
chindw at wfu.edu High Performance Computing Systems Analyst
Office: 336-758-2964 Wake Forest University
Mobile: 336-608-0793 Winston-Salem, NC
Email-to-txt: 3366080793 at mms.att.net
Google Talk: chindw at wfu.edu
More information about the torqueusers