Hi Gus<div><br></div><div>Thanks for the info, but this doesn't seem to be related to why $PBS_NODEFILE only ever contains the entries for one node. I can ssh as myself and root passwordless between the headnode and compute nodes, using short hostnames, so I don't think there is a problem there.</div>
<div><br></div><div>Kind regards</div><div>Gordon<br clear="all"><br>-- max(∫(εὐδαιμονία)dt)<br><br>Dr Gordon Wells<br>Bioinformatics and Computational Biology Unit<br>Department of Biochemistry<br>University of Pretoria<br>
<br><br><div class="gmail_quote">On 11 October 2010 19:10, Gus Correa <span dir="ltr"><<a href="mailto:gus@ldeo.columbia.edu">gus@ldeo.columbia.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class="im">Gordon Wells wrote:<br>
> Hi<br>
><br>
> The varies /etc/hosts, nodes, server_name and config files and seem to<br>
> be consistent. The nodes are indeed connected to the internet, could<br>
> that be problematic?<br>
<br>
</div>Hi Gordon<br>
<br>
Yes, if the nodes are behind firewalls, or have some IP table setting<br>
restricting the connections.<br>
A firewall may prevent torque and MPI from working.<br>
Moreover, using the Internet addresses,<br>
the network traffic may hurt performance (MPI, I/O, etc).<br>
<br>
Here I (and most people) use a private subnet for this, say 192.168.1.0,<br>
or 10.1.1.0 either one with netmask 255.255.255.0, for this.<br>
Sometimes two private subnets, one for cluster control and I/O,<br>
another for MPI.<br>
Typical server motherboards come with two onboard Ethernet ports,<br>
but you can also plug in Gigabit Ethernet NICs on available motherboard<br>
slots.<br>
You could buy a cat5e cables and new switch for this, or if your switch<br>
has VLAN capability and enough idle ports,<br>
you can create a virtual subnet on it.<br>
<br>
On each node you have to configure these new interfaces properly,<br>
either through DHCP or statically (quite easy, put the IP<br>
addresses and the netmask on<br>
/etc/sysconfig/network-scripts/ifcfg-eth1, assuming eth1<br>
is the private subnet interface ... oh well, this is for<br>
RHEL/CentOS/Fedora, it may somewhat<br>
different in Debian/Ubuntu or SLES).<br>
<br>
Then insert names for these interfaces and associated IPs on the<br>
/etc/hosts files (same on all nodes).<br>
For instance:<br>
<br>
192.168.1.1 node01<br>
...<br>
<br>
The same names should be also used in the ${TORQUE}/server_priv/nodes file.<br>
<br>
In any case, either using the Internet or a private subnet,<br>
you need to make sure the users can<br>
ssh passwordless across all pairs of nodes.<br>
Can you do this on all node pairs on your cluster?<br>
<br>
This can be done, for instance, by creating a ssh-rsa key pair,<br>
and putting a bunch of copies of the public key on<br>
/etc/ssh/ssh_known_hosts2 on all nodes,<br>
something like this:<br>
<br>
192.168.1.1,node01 ssh-rsa [the same ssh-rsa public key copy goes here]<br>
192.168.1.2,node02 ssh-rsa [the same ssh-rsa public key copy goes here]<br>
...<br>
<br>
However, you *don't want to do this with public IP addresses*,<br>
only with private ones.<br>
(Yet another issue with using the Internet for Torque and MPI.)<br>
<div class="im"><br>
I hope this helps,<br>
Gus Correa<br>
<br>
<br>
<br>
><br>
</div><div class="im">> As for 5), won't that require $PBS_NODEFILE to be correctly generated?<br>
><br>
> Regards<br>
> Gordon<br>
><br>
> -- max(∫(εὐδαιμονία)dt)<br>
><br>
> Dr Gordon Wells<br>
> Bioinformatics and Computational Biology Unit<br>
> Department of Biochemistry<br>
> University of Pretoria<br>
><br>
><br>
> On 8 October 2010 01:09, Gus Correa <<a href="mailto:gus@ldeo.columbia.edu">gus@ldeo.columbia.edu</a><br>
</div><div><div></div><div class="h5">> <mailto:<a href="mailto:gus@ldeo.columbia.edu">gus@ldeo.columbia.edu</a>>> wrote:<br>
><br>
> Hi Gordon<br>
><br>
> Some guesses:<br>
><br>
> 1) Do you have mom daemons running on the nodes?<br>
> I.e. on the nodes, what is the output of "service pbs status" or<br>
> "service pbs_mom status"?<br>
><br>
> 2) Do your mom daemons on the nodes point to the server?<br>
> I.e. what is the content of $TORQUE/mom_priv/config?<br>
> Is it consistent with the server name in $TORQUE/server_name ?<br>
><br>
> 3) What is the content of your /etc/hosts file on the head node<br>
> and on each node?<br>
> Are they the same?<br>
> Are they consistent with your nodes file,<br>
> i.e. head_node:$TORQUE/server_priv/nodes (i.e. same host names<br>
> that have IP addresses listed in /etc/hosts)?<br>
><br>
> 4) Are you really using the Internet to connect the nodes,<br>
> as the fqdn names on your nodes file (sent in an old email) suggest?<br>
> (I can't find it, maybe you can post it again.)<br>
> Or are you using a private subnet?<br>
><br>
> 5) Did you try to run hostname via mpirun on all nodes?<br>
> I.e., something like this:<br>
><br>
> ...<br>
> #PBS -l nodes=8:ppn=2<br>
> ...<br>
> mpirun -np 16 hostname<br>
><br>
><br>
> I hope this helps,<br>
> Gus Correa<br>
><br>
> Gordon Wells wrote:<br>
> > I've tried that, unfortunately I never get a $PBS_NODEFILE that spans<br>
> > more than one node.<br>
> ><br>
> > -- max(∫(εὐδαιμονία)dt)<br>
> ><br>
> > Dr Gordon Wells<br>
> > Bioinformatics and Computational Biology Unit<br>
> > Department of Biochemistry<br>
> > University of Pretoria<br>
> ><br>
> ><br>
> > On 7 October 2010 10:02, Vaibhav Pol <<a href="mailto:vaibhavp@cdac.in">vaibhavp@cdac.in</a><br>
> <mailto:<a href="mailto:vaibhavp@cdac.in">vaibhavp@cdac.in</a>><br>
</div></div><div><div></div><div class="h5">> > <mailto:<a href="mailto:vaibhavp@cdac.in">vaibhavp@cdac.in</a> <mailto:<a href="mailto:vaibhavp@cdac.in">vaibhavp@cdac.in</a>>>> wrote:<br>
> ><br>
> > Hi ,<br>
> > you must set server as well as queue attribute.<br>
> ><br>
> > set server resources_available.nodect = (number of<br>
> nodes *<br>
> > cpus per node)<br>
> > set <queue name> resources_available.nodect = (number of<br>
> > nodes * cpus per node)<br>
> ><br>
> ><br>
> > Thanks and regards,<br>
> > Vaibhav Pol<br>
> > National PARAM Supercomputing Facility<br>
> > Centre for Development of Advanced Computing<br>
> > Ganeshkhind Road<br>
> > Pune University Campus<br>
> > PUNE-Maharastra<br>
> > Phone +91-20-25704176 ext: 176<br>
> > Cell Phone : +919850466409<br>
> ><br>
> ><br>
> ><br>
> > On Thu, 7 Oct 2010, Gordon Wells wrote:<br>
> ><br>
> > Hi<br>
> ><br>
> > I've now tried torque 2.5.2 as well, same problems.<br>
> > Setting resources_available.nodect has no effect except<br>
> allowing<br>
> > me to use<br>
> > "-l nodes=x" with x > 14<br>
> ><br>
> > regards<br>
> ><br>
> > -- max(∫(εὐδαιμονία)dt)<br>
> ><br>
> > Dr Gordon Wells<br>
> > Bioinformatics and Computational Biology Unit<br>
> > Department of Biochemistry<br>
> > University of Pretoria<br>
> ><br>
> ><br>
> > On 6 October 2010 20:04, Glen Beane <<a href="mailto:glen.beane@gmail.com">glen.beane@gmail.com</a><br>
> <mailto:<a href="mailto:glen.beane@gmail.com">glen.beane@gmail.com</a>><br>
> > <mailto:<a href="mailto:glen.beane@gmail.com">glen.beane@gmail.com</a><br>
> <mailto:<a href="mailto:glen.beane@gmail.com">glen.beane@gmail.com</a>>>> wrote:<br>
> ><br>
> > On Wed, Oct 6, 2010 at 1:12 PM, Gordon Wells<br>
> > <<a href="mailto:gordon.wells@gmail.com">gordon.wells@gmail.com</a><br>
</div></div>> <mailto:<a href="mailto:gordon.wells@gmail.com">gordon.wells@gmail.com</a>> <mailto:<a href="mailto:gordon.wells@gmail.com">gordon.wells@gmail.com</a><br>
<div class="im">> <mailto:<a href="mailto:gordon.wells@gmail.com">gordon.wells@gmail.com</a>>>><br>
> > wrote:<br>
> ><br>
> > Can I confirm that this will definitely fix the<br>
> problem?<br>
> > Unfortunately<br>
> ><br>
> > this<br>
> ><br>
> > cluster also needs to be glite compatible, 2.3.6<br>
> seems<br>
> > to be the latest<br>
> ><br>
> > that<br>
> ><br>
> > will work<br>
> ><br>
> ><br>
> ><br>
> > i'm not certain... do you happen to have set server<br>
> > resources_available.nodect set? I have seen bugs with<br>
> > PBS_NODEFILE<br>
> > contents when this server attribute is set. This may<br>
> be a<br>
> > manifestation of this bug, and I'm not sure if it has<br>
> been<br>
> > corrected.<br>
> ><br>
> > try unsetting this and submitting a job with -l<br>
> nodes=X:ppn=Y<br>
> > _______________________________________________<br>
> > torqueusers mailing list<br>
> > <a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a><br>
> <mailto:<a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a>><br>
</div>> > <mailto:<a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a><br>
<div class="im">> <mailto:<a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a>>><br>
> > <a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
> ><br>
> ><br>
> > --<br>
> > This message has been scanned for viruses and<br>
> > dangerous content by MailScanner, and is<br>
> > believed to be clean.<br>
> ><br>
> ><br>
> > --<br>
> > This message has been scanned for viruses and<br>
> > dangerous content by MailScanner, and is<br>
> > believed to be clean.<br>
> ><br>
> ><br>
> > _______________________________________________<br>
> > torqueusers mailing list<br>
> > <a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a><br>
> <mailto:<a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a>><br>
</div>> <mailto:<a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a><br>
<div><div></div><div class="h5">> <mailto:<a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a>>><br>
> > <a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
> ><br>
> ><br>
> ><br>
> ><br>
> ------------------------------------------------------------------------<br>
> ><br>
> > _______________________________________________<br>
> > torqueusers mailing list<br>
> > <a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a> <mailto:<a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a>><br>
> > <a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
><br>
> _______________________________________________<br>
> torqueusers mailing list<br>
> <a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a> <mailto:<a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a>><br>
> <a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
><br>
><br>
><br>
> ------------------------------------------------------------------------<br>
><br>
> _______________________________________________<br>
> torqueusers mailing list<br>
> <a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a><br>
> <a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
<br>
_______________________________________________<br>
torqueusers mailing list<br>
<a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a><br>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
</div></div></blockquote></div><br></div>