[torqueusers] job exceed memory limit without been killed
ant.starikov at gmail.com
Thu Mar 18 07:26:05 MDT 2010
Job is scheduled on node with 64GB RAM. memory limit for job is 60GB. At some point job exceed memory limit and crash node. It would be understandable if this happens somewhere in between of two checks by PBS_MOM, but after crash I check what server knows about job and I see:
Resource_List.mem = 60gb
resources_used.mem = 65557856kb
Which means that PBS_MOM already registered memory usage above limit and even updated this information on server, but didn't react and kill the job.
What can be wrong? Do I miss something in the config?
More information about the torqueusers