<div>Well, you definitley came up with something interesting. The NODEAVAILABILITYPOLICY looks as if it should help me to resolve this issue (but currently it didn't...yet).</div>
<div> </div>
<div>I've made the following tests trying to figure what's behind the scenes of the cluster:</div>
<div> </div>
<div>1. I listed all the nodes that diagnose -n says: "has more processors utilized than dedicated"</div>
<div>2. Then I submitted several very short jobs (2 minutes) and designated each one of them to each one of the nodes listed above. I used the -l host={nodename} -l walltime=00:00:02 (The walltime time purpose is to make sure MAUI will not activate any reservation policies on the jobs (In fact the cluster had many free CPUs at the time I made the test, so no reservations are expected)). I expected the jobs *not* to go in to R state, because each and every job was targeted to a node that "has more processors utilized than dedicated" .</div>
<div>3. Indeed that's what happend! None of the jobs went from Q state to R state. They have been waiting there for very long time (hours).</div>
<div>4. I then checked the load average on each of the nodes listed above, and I indeed found that their load average is higher than their configured resources. For example, if the 'nodes' file says 'node22 np=4' , I checked it's load average at the time it had the "has more processors utilized than dedicated" . I found that though this node runs only 2 jobs at the moment, the load average is above it (about 2.70). I expect this node to run 4 jobs at the same time.</div>
<div> </div>
<div>> Are these2 jobs multithreaded? Is the load ~4 while it should be ~2?</div>
<div>I'm not sure if they are multithreaded (needs further checking with the developers) - but you're right. The load should be no more than 2 for 2 jobs, but infact its >2 . The jobs are C++ compiled with g++ compiler. Maybe a compilation switch will help with reducing the load average to 1 per job?</div>
<div> </div>
<div>I then moved to the next step, and set the NODEAVAILABILITYPOLICY to UTILIZED. The showconfig command now says:</div>
<div>NODEAVAILABILITYPOLICY[0] UTILIZED:[DEFAULT]<br></div>
<div>As this didn't make the jobs run, perhaps it's a matter of another tweak in the NODEAVAILABILTY policy?</div>
<div> </div>
<div>And yet another thing about the diagnose -j output : I'm not sure if and how should I treat the 'WARNING: job '{job_id}' utilizes more memory than dedicated (xxxx > 512) ' . A vmstat test shows that indeed jobs are heavily swapping on the node.</div>
<div> </div>
<div>Thanks,</div>
<div>Itay.</div>
<div><br> </div>
<div class="gmail_quote">On Jan 30, 2008 12:26 AM, Jan Ploski <<a href="mailto:Jan.Ploski@offis.de">Jan.Ploski@offis.de</a>> wrote:<br>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid"><br><font color="#888888">Jan Ploski<br></font></blockquote></div><br>