Torque was really easy to install, but it seems like my /etc/hosts file must be screwed up, as I can't get the cluster nodes to respond. Specifically, within a cluster of 3 machines, each having an /etc/hosts file of:
<br><br> <a href="http://127.0.0.1">127.0.0.1</a> localhost.localdomain localhost<br> <a href="http://199.17.152.17">199.17.152.17</a> runner<br> <a href="http://199.17.152.135">199.17.152.135</a> muscovey
<br> <a href="http://199.17.152.13">199.17.152.13</a> pekin<br> (( other workstations follow ))<br><br>Now, when I have the pbs_server running on runner, and the pbs_mom daemons running on muscovey, pekin, and runner, I et the following status message,
<br><br> [root@runner torque-2.1.6]# pbsnodes -a<br> pekin<br> state = down<br> np = 1<br> ntype = cluster<br><br> muscovey<br> state = down<br> np = 1<br> ntype = cluster
<br><br> runner<br> state = down <br> np = 1<br> ntype = cluster<br><br>I realize this is a pretty low-level question, but what the heck is wrong with my /etc/hosts file?<br><br>regards,<br><br>
NT<br><br><br>ps, the trouble shooting message given by torque is,<br><br> [root@runner torque-2.1.6]# momctl -d 3<br><br> Host: runner/runner Version: 2.1.6<br> WARNING: server not specified (set $pbsserver)
<br> PID: 30531<br> HomeDirectory: /var/spool/torque/mom_priv<br> MOM active: 2518 seconds<br> Server Update Interval: 45 seconds<br> LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust)
<br> Communication Model: RPP<br> TCP Timeout: 20 seconds<br> NOTE: no prolog configured<br> Alarm Time: 0 of 10 seconds<br> Trusted Client List: <a href="http://199.17.152.17">
199.17.152.17</a>,<a href="http://127.0.0.1">127.0.0.1</a><br> Configured to use /usr/bin/scp -rpB<br> NOTE: no local jobs detected<br><br> diagnostics complete<br><br clear="all"><br>- - - - - - - - - - - - - - - - - - - - -
<br>Nathan Moore<br>Assistant Professor, Physics<br>Winona State University<br>AIM: nmoorewsu <br>- - - - - - - - - - - - - - - - - - - - -