<br><br>
<div class="gmail_quote">On Mon, Jun 23, 2008 at 2:57 PM, Kamil Kisiel <<a href="mailto:kamil@zymeworks.com">kamil@zymeworks.com</a>> wrote:<br>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<div style="WORD-WRAP: break-word">
<div class="Ih2E3d">
<div>
<div>On 9-Jun-08, at 14:02 , Kamil Kisiel wrote:</div><br>
<blockquote type="cite">
<div>Occasionally some of our cluster nodes send out a syslog message such as:<br><br><a href="http://node071.cluster.zymeworks.com/" target="_blank">node071.cluster.zymeworks.com</a> pbs_mom: No such process (3) in resi_sum, 797: get_proc_stat<br>
<br>The number after "resi_sum" is different in each message, presumably it's the PID of some process.<br><br>What does this mean, and what could be causing it?<br></div></blockquote></div><br></div>
<div>So far I haven't had any reply to this. Nobody has any clue?</div></div></blockquote>
<div> </div>
<div>How often do you see this? I haven't had a chance to look at this in detail, but what could be happening is the process with that PID is dieing and resi_sum is being called before pbs_mom picks up the exiting process. If it happens often, then please provide me with as much information as you can (especially TORQUE version)</div>
<div> </div>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<div style="WORD-WRAP: break-word">
<div><span id=""></span><br><br>I also noticed that jobs run through MPI are under-reporting the cputime used in qstat output. Is that related, or a separate issue? </div></div></blockquote>
<div> </div>
<div>Which MPI do you use, and which job launcher do you use? If the job launcher you use is not using TM (the task manager API provided by TORQUE, OpenPBS/PBS Pro) to spawn all of the remote processes then the cpu time will be under reported (these processes will be outside the control of TORQUE). If you let us know what MPI you use and what job launcher you use (mpiexec/mpirun) we can know for sure if this what is going on. In addition to the under reporting of cpu time, using a non-TM launcher can also lead to processes that aren't always cleaned up when a job crashes or is killed prematurely.</div>
<div> </div></div>