<div>Here is the diagnose -j on these two jobs that are running on node28:</div>
<div>/==============================/</div>
<div>diagnose -j 228620<br>Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features</div>
<p>228620 Running DEF 1 low 10:00:00:00 1 1 ad_user pu_group - 2:49:41 [NONE] [NONE] [NONE] >=0 >=0 NC0 [heavy:1] [NONE]<br>WARNING: job '228620' utilizes more memory than dedicated (3432 > 512)</p>
<p>diagnose -j 228621<br>Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features</p>
<p>228621 Running DEF 1 low 10:00:00:00 1 1 ad_user pu_group - 2:49:41 [NONE] [NONE] [NONE] >=0 >=0 NC0 [heavy:1] [NONE]<br>WARNING: job '228621' utilizes more memory than dedicated (3595 > 512)</p>
<div>/==============================/</div>
<div> </div>
<div>And here is the checkjob -v on these two jobs:</div>
<div> </div>
<div>/==============================/</div>
<div> </div>
<div>checking job 228620 (RM job '228620.cluster')</div>
<div>State: Running<br>Creds: user:ad_user group:pu_group class:heavy qos:low<br>WallTime: 6:31:31 of 10:00:00:00<br>SubmitTime: Tue Jan 29 16:14:14<br> (Time Queued Total: 00:00:01 Eligible: 00:00:01)</div>
<div>StartTime: Tue Jan 29 16:14:15<br>Total Tasks: 1</div>
<div>Req[0] TaskCount: 1 Partition: DEFAULT<br>Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0<br>Opsys: [NONE] Arch: [NONE] Features: [NONE]<br>Exec: '' ExecSize: 0 ImageSize: 0<br>Dedicated Resources Per Task: PROCS: 1 MEM: 512M<br>
Utilized Resources Per Task: PROCS: 0.13 MEM: 34.32 SWAP: 35.44<br>Avg Util Resources Per Task: PROCS: 0.10<br>Max Util Resources Per Task: PROCS: 0.13 MEM: 34.32 SWAP: 35.44<br>Average Utilized Memory: 3408.54 MB<br>
Average Utilized Procs: 0.61<br>NodeAccess: SHARED<br>NodeCount: 1<br>Allocated Nodes:<br>[node28:1]<br>Task Distribution: node28</div>
<div><br>IWD: [NONE] Executable: [NONE]<br>Bypass: 0 StartCount: 1<br>PartitionMask: [ALL]<br>SystemQueueTime: Tue Jan 29 19:53:18</div>
<div>Flags: RESTARTABLE</div>
<div>Reservation '228620' (-6:31:19 -> 9:17:28:41 Duration: 10:00:00:00)<br>PE: 1.00 StartPriority: 200<br></div>
<div> </div>
<p>checking job 228621 (RM job '228621.cluster')</p>
<p>State: Running<br>Creds: user:ad_user group:pu_group class:heavy qos:low<br>WallTime: 6:24:00 of 10:00:00:00<br>SubmitTime: Tue Jan 29 16:22:46<br> (Time Queued Total: 00:00:01 Eligible: 00:00:01)</p>
<p>StartTime: Tue Jan 29 16:22:47<br>Total Tasks: 1</p>
<div>Req[0] TaskCount: 1 Partition: DEFAULT<br>Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0<br>Opsys: [NONE] Arch: [NONE] Features: [NONE]<br>Exec: '' ExecSize: 0 ImageSize: 0<br>Dedicated Resources Per Task: PROCS: 1 MEM: 512M<br>
Utilized Resources Per Task: PROCS: 0.10 MEM: 35.95 SWAP: 39.56<br>Avg Util Resources Per Task: PROCS: 0.08<br>Max Util Resources Per Task: PROCS: 0.10 MEM: 35.95 SWAP: 39.56<br>Average Utilized Memory: 3561.67 MB<br>
Average Utilized Procs: 0.58<br>NodeAccess: SHARED<br>NodeCount: 1<br>Allocated Nodes:<br>[node28:1]<br>Task Distribution: node28</div>
<div><br>IWD: [NONE] Executable: [NONE]<br>Bypass: 0 StartCount: 1<br>PartitionMask: [ALL]<br>SystemQueueTime: Tue Jan 29 19:53:18</div>
<p>Flags: RESTARTABLE</p>
<p>Reservation '228621' (-6:23:49 -> 9:17:36:11 Duration: 10:00:00:00)<br>PE: 1.00 StartPriority: 200</p>
<p> </p>
<div>/==============================/</div>
<div><br>what does the 0:4 means?</div>
<div>Could this be related to the way in which the user is running the job itself (the one that qsub runs) ?</div>
<div>Or should I check something in the nodes? something related to load average? else?<br>BTW, almost all of our jobs have the 'WARNING: job '{job_id}' utilizes more memory than dedicated (xxxx > 512) . Should I change the default memory assigned for the jobs? Currently the default is 512MB.<br>
<br></div>
<div class="gmail_quote">On Jan 29, 2008 10:36 PM, Jan Ploski <<a href="mailto:Jan.Ploski@offis.de">Jan.Ploski@offis.de</a>> wrote:<br>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<div class="Ih2E3d"><br><br> </div>Can you also report the output of checkjob and diagnose -j on these 2<br>jobs? Do they also have the MEM requirement?<br>
<div class="Ih2E3d"><br>> About the MEM requirement: do you mean to unset it to? other than that<br>> we don't use any MEM requierment in our qsub script.<br><br></div>Well, it must be coming from somewhere, quite possibly from a default in<br>
the queue or server configuration. So I'd try unsetting it there.<br>However, looking at the diagnose -n output above makes me think it is<br>processor related - judging from the 0:4, for some unknown reason your<br>jobs consume 2 processors each rather than 1.<br>
<br>Regards,<br><font color="#888888">Jan Ploski<br></font></blockquote></div><br>