<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.3132" name=GENERATOR></HEAD>
<BODY>
<DIV><SPAN class=683532215-02102007><FONT face="Arial Narrow">Hello
All</FONT></SPAN></DIV>
<DIV><SPAN class=683532215-02102007><FONT face="Arial Narrow">We are running the
following versions of torque/maui..</FONT></SPAN></DIV>
<DIV><FONT face="Arial Narrow">Maui version
maui-3.2.6p16-snap.1157560841</DIV></FONT>
<DIV><FONT face="Arial Narrow"> /opt/sched/commands/sbin/pbs_server
--version<BR>version: 2.1.6<BR></FONT></DIV>
<DIV><FONT><SPAN class=683532215-02102007><FONT face="Arial Narrow">My question
is this..our cluster is quite busy..has about 100
nodes..</FONT></SPAN></FONT></DIV>
<DIV><FONT face="Arial Narrow"><SPAN
class=683532215-02102007></SPAN></FONT> </DIV>
<DIV><FONT><SPAN class=683532215-02102007><FONT face="Arial Narrow">Every once
in a great while the system goes haywire in the following
way..</FONT></SPAN></FONT></DIV>
<DIV><FONT face="Arial Narrow"><SPAN
class=683532215-02102007></SPAN></FONT> </DIV>
<DIV><FONT><SPAN class=683532215-02102007><FONT face="Arial Narrow">We might
have hundreds of jobs running without any problems..and then at some point in
time 10 nodes or so might become available..</FONT></SPAN></FONT></DIV>
<DIV><FONT><SPAN class=683532215-02102007><FONT face="Arial Narrow">Lets say it
is nodes 1-10 that could be used..</FONT></SPAN></FONT></DIV>
<DIV><FONT><SPAN class=683532215-02102007><FONT face="Arial Narrow">What happens
next is this..the queued jobs are then all scheduled against node1 and fly
through the system without ever scheduling jobs against the remaining nodes
9-10.</FONT></SPAN></FONT></DIV>
<DIV><FONT><SPAN class=683532215-02102007><FONT face="Arial Narrow">When the
user calls and says all his jobs have failed and we need to figure out what has
happened..we realize at this point that></FONT></SPAN></FONT></DIV>
<DIV><FONT face="Arial Narrow"><SPAN
class=683532215-02102007></SPAN></FONT> </DIV>
<DIV><FONT><SPAN class=683532215-02102007><FONT face="Arial Narrow">Node one is
only pingable and we can't rsh into the target node1 to see what is going
on..When we go to the console of node1..we see that maybe it has suffered a disk
crash and is in a weird state still somewhat limping
along..</FONT></SPAN></FONT></DIV>
<DIV><FONT><SPAN class=683532215-02102007><FONT face="Arial Narrow">BUT from the
scheduler point of view...pbsnodes -a reports the node1 as free and it is
pingable from the headnode...and because of such pbsnodes status report all
the queued jobs get delivered to node1 and
disappear...</FONT></SPAN></FONT></DIV>
<DIV><FONT><SPAN class=683532215-02102007><FONT face="Arial Narrow">How do we
get around this problem? Has this particular issue been addressed in newer
versions of maui/torque?</FONT></SPAN></FONT></DIV>
<DIV><FONT><SPAN class=683532215-02102007><FONT face="Arial Narrow">Thanks for
any advice</FONT></SPAN></FONT></DIV>
<DIV><FONT><SPAN class=683532215-02102007><FONT
face="Arial Narrow">Dan</FONT></SPAN></DIV></FONT></BODY></HTML>