<div dir="ltr">It is the only check_ps we're using, but after your explanation, I'm going to stick more in :)<div><br></div><div style>Thanks again,</div><div style> - Matt</div><div style><br></div></div><div class="gmail_extra">
<br><br><div class="gmail_quote">On Thu, Jan 24, 2013 at 3:51 PM, Michael Jennings <span dir="ltr"><<a href="mailto:mej@lbl.gov" target="_blank">mej@lbl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On Thursday, 24 January 2013, at 15:26:14 (-0500),<br>
<div class="im">Matt Britt wrote:<br>
<br>
> Thanks Michael - that got me pointed in the right direction. We're<br>
> just using /etc/passwd, and it should be up to date. The function<br>
> using the time was 'check_ps_daemon sshd root':<br>
><br>
> [root@nyx5506 msbritt]# time nhc (with check_ps_daemon)<br>
><br>
><br>
><br>
><br>
><br>
><br>
> real 0m5.785s<br>
> user 0m5.565s<br>
> sys 0m0.101s<br>
> [root@nyx5506 msbritt]# !vim<br>
> vim /etc/nhc/nhc.conf<br>
> [root@nyx5506 msbritt]# time nhc (without check_ps_daemon)<br>
><br>
> real 0m0.185s<br>
> user 0m0.109s<br>
> sys 0m0.055s<br>
<br>
</div>Wow, that's quite a difference. :-)<br>
<br>
Is that the only check_ps_* check in your configuration? I'm guessing<br>
it is based on the time delay.<br>
<br>
What happens is this: the first time you use one of the process-based<br>
checks, NHC will run the "ps" command to gather information on all<br>
your system processes. This can, as you're seeing, take quite a bit<br>
of time on a heavily-loaded compute node. However, it only needs to<br>
do this once; if you use one ps-based check, you can use as many as<br>
you want because you've already "taken the hit" of the subprocess<br>
overhead. Subsequent checks will used the cached data instead of<br>
launching "ps" again.<br>
<br>
Glad you found the culprit! NHC tries to be as efficient as possible<br>
in everything it does, but it's up to each site to determine how they<br>
want to balance the tradeoffs between longer/shorter execution time<br>
for NHC and more/less comprehensive assessments of node health. I<br>
tried to make it as easy as possible to measure and evaluate those<br>
tradeoffs; hopefully I succeeded. :-)<br>
<div class="HOEnZb"><div class="h5"><br>
Michael<br>
<br>
--<br>
Michael Jennings <<a href="mailto:mej@lbl.gov">mej@lbl.gov</a>><br>
Senior HPC Systems Engineer<br>
High-Performance Computing Services<br>
Lawrence Berkeley National Laboratory<br>
Bldg 50B-3209E W: <a href="tel:510-495-2687" value="+15104952687">510-495-2687</a><br>
MS 050B-3209 F: <a href="tel:510-486-8615" value="+15104868615">510-486-8615</a><br>
_______________________________________________<br>
torqueusers mailing list<br>
<a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a><br>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
</div></div></blockquote></div><br></div>