You guys RULES!!. Thank you so much!! I reinstalled everything and configured according your instructions and it all went smooth.<br><br>For the record The GLOBUS Toolkit tutorial is outdated and is better to follow the official TORQUE instructions <a href="http://www.clusterresources.com/torquedocs21/index.shtml">http://www.clusterresources.com/torquedocs21/index.shtml</a><br>
<br>To compile under YellowDog 6.1 over Playstation 3 one must execute this:<br><br>./configure --disable-gcc-warnings CC="gcc -m64"<br><br>aparently it suffers the same problem as with Mac OS X as does not support -pendantic -Werror (what ever that is). This is just my guess since the first time that I compile it, I got a bunch of recursive errors regarding -pendantic -Werror.<br>
<br>I still have some questions but those are for pure curiosity.<br><br>What is -pendantic -Werror?<br><br>Does it make a big difference not having suport for it?<br><br>When I executed "make packages" I got a some shell scripts to install the packages . I used only "mom" and "clients" but I also got "devel" which its purpose is clear and server but even without explicity executing it I got pbs_server installed into my node. So is it torque-package-server-linux-powerpc64.sh a different kind of server? If so what is its purpose?<br>
<br><div class="gmail_quote">2009/5/21 Samir Gartner <span dir="ltr"><<a href="mailto:jigzat@gmail.com">jigzat@gmail.com</a>></span><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
PD: I got this warning after executing make packages for each one of the packages<br><br>libtool: install: warning: remember to run `libtool --finish /usr/local/lib'<br><br>should I execute it?<div><div></div><div class="h5">
<br><br><div class="gmail_quote">
2009/5/21 Samir Gartner <span dir="ltr"><<a href="mailto:jigzat@gmail.com" target="_blank">jigzat@gmail.com</a>></span><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Ok, I decided to reinstall everything and configure the system according to everyone instructions and suggestions. But I have a doubt. In the Globustoolkit tutorial instructions says to only execute mom and clients shell scrips but not server. <br>
<br>tar -zxf torque-2.0.0p7.tar.gz<br>
cd torque-2.0.0p7<br>
./configure --prefix=/opt/pbs<br>
make<br>
make install<br>
make packages<br>
./torque-package-clients-linux-i686.sh --install --destdir /opt/pbs<br>
./torque-package-mom-linux-i686.sh --install --destdir /opt/pbs<br><br>As rufian node is the only node with torque, shouldn't I also execute the server script?<div><div></div><div><br>
<br><div class="gmail_quote">2009/5/21 Gus Correa <span dir="ltr"><<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a>></span><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi Samir<br>
<br>
As Jerry said, 127.0.0.1 is the IP address of the "loopback interface"<br>
(not a physical Ethernet port) on each computer.<br>
This is not to be confused with the IP address associated to the actual<br>
Ethernet port on the network you want to use for MPI communication.<br>
<br>
<br>
1) Looking at your hosts file and your question ("Is this wrong?"),<br>
I would suggest:<br>
<br>
A) Uncomment this line:<br>
<div><br>
#127.0.0.1 localhost.localdomain localhost<br>
<br>
</div>i.e, it should be:<br>
<div><br>
127.0.0.1 localhost.localdomain localhost<br>
<br>
</div>You need a loopback, but pointing to localhost.<br>
<br>
B) Change this line:<br>
<div><br>
127.0.0.1 rufian.perrera.local rufian<br>
<br>
</div>to something like this:<br>
<br>
192.168.2.1 rufian.perrera.local rufian<br>
<br>
assuming the IP address 192.168.2.1 is not in use on your (private) net<br>
192.168.2.0 (otherwise use another IP on the same net).<br>
<br>
C) Make sure this is consistent with whatever you have in<br>
/etc/sysconfig/network.<br>
(It seems to be OK, you only have the hostname, not the IP there.)<br>
<br>
D) Restart the network, or much easier, just reboot the computer.<br>
<br>
E) Make sure your other computers (auyin, pelusa, lamparita)<br>
have correct hosts file too, which should<br>
list in a consistent way all the computers on your 192.168.2.0 net,<br>
include the loopback interface as explained above.<br>
<br>
F) Each computer loops back to itself with the same special address<br>
127.0.0.1, as Jerry explained.<br>
The IP 127.0.0.1 cannot be used as a regular IP.<br>
If you read carefully the top lines on the hosts file you will see<br>
the message: "Do not remove the following line".<br>
Well, commenting it out has the same effect, replacing it by another<br>
hostname is even worse, as you may have noticed.<br>
<div><br>
>> # Do not remove the following line, or various programs<br>
>> # that require network functionality will fail.<br>
>> #127.0.0.1 localhost.localdomain localhost<br>
<br>
<br>
</div>***<br>
<br>
2) As I and Jerry told you, copy over the pbs_sched in the contrib<br>
directory to /etc/init.d, use chkconfig --add pbs_sched<br>
to come up when the machine boots, and start it manually this first<br>
time (/etc/init.d/pbs_sched start), or just reboot again.<br>
<br>
<br>
3) Don't worry about not having /etc/sysconfig/pbs_server or<br>
/etc/sysconfig/pbs_sched. It seems to be a legacy way to setup<br>
Torque/PBS. I don't have them either, and it works.<br>
<div><br>
I hope this helps,<br>
Gus Correa<br>
---------------------------------------------------------------------<br>
Gustavo Correa<br>
Lamont-Doherty Earth Observatory - Columbia University<br>
Palisades, NY, 10964-8000 - USA<br>
---------------------------------------------------------------------<br>
<br>
<br>
<br>
</div><div><div></div><div>Jerry Smith wrote:<br>
> 127.0.0.1 is a special address that references localhost.<br>
> <a href="http://en.wikipedia.org/wiki/Localhost" target="_blank">http://en.wikipedia.org/wiki/Localhost</a><br>
><br>
><br>
><br>
> 127.0.0.1 is not what you want for your hostname ( pbs_moms trying to<br>
> connect to 127.0.0.1 will try to talk to themselves)<br>
><br>
> You will want to setup an IP address on your pbs_server/scheduler node<br>
> that corresponds to the network that your pbs_moms are on.<br>
> And then make sure that the hostname you give it matches that of the<br>
> file in $PBS_HOME/server<br>
><br>
> Copying the init script to /etc/init.d is a start, you will then<br>
> probably need to turn it on by running :<br>
><br>
> To set it up to start on reboot:<br>
><br>
> chkconfig add pbs_sched<br>
> and then<br>
> chkconfig pbs_sched on<br>
><br>
> To start it use /etc/init.d/pbs_sched start<br>
><br>
><br>
> --Jerry<br>
><br>
><br>
> Samir Gartner wrote:<br>
>> Ok Gus and everyone. Thanks again for your answers.<br>
>><br>
>> There is no pbs_sched on /etc/init.d but it is here:<br>
>><br>
>> /usr/local/src/torque-2.3.6/contrib/init.d/pbs_sched<br>
>> /usr/local/src/torque-2.3.6/tpackages/server/opt/pbs/sbin/pbs_sched<br>
>> /usr/local/src/torque-2.3.6/src/scheduler.cc/.libs/pbs_sched<br>
>> /usr/local/src/torque-2.3.6/src/scheduler.cc/pbs_sched<br>
>> /opt/pbs/sbin/pbs_sched<br>
>><br>
>> I was thinking copying /opt/pbs/sbin/pbs_sched to /etc/init.d. Is it<br>
>> right to do that?<br>
>><br>
>> Sorry about the "manually" word. It is local slang I guess. What I<br>
>> mean is that I went to the /opt/pbs/sbin/ folder and executed ./pbs_sched<br>
>><br>
>> hostname output is:<br>
>><br>
>> rufian.perrera.local<br>
>><br>
>> hosts file contain:<br>
>><br>
>> # Do not remove the following line, or various programs<br>
>> # that require network functionality will fail.<br>
>> #127.0.0.1 localhost.localdomain localhost<br>
>> <--------------------------Is this wrong?<br>
>> ::1 localhost6.localdomain6 localhost6<br>
>> 127.0.0.1 rufian.perrera.local rufian<br>
>> 192.168.2.6 auyin.perrera.local auyin<br>
>> 192.168.2.4 pelusa.perrera.local pelusa<br>
>> 192.168.2.2 lamparita.perrera.local lamparita<br>
>><br>
>><br>
>> network content is:<br>
>><br>
>> NETWORKING=yes<br>
>> HOSTNAME=rufian.perrera.local<br>
>> DOMAINNAME=perrera.local<br>
>><br>
>> I dont have /etc/sysconfig/pbs_server nor /etc/sysconfig/pbs_sched either<br>
>><br>
>><br>
>> 2009/5/21 Gus Correa <<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a><br>
</div></div>>> <mailto:<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a>>><br>
<div><div></div><div>>><br>
>> Samir Gartner wrote:<br>
>> > Ok, scheduling wasn't enabled,now it is,<br>
>><br>
>> It happens very often.<br>
>> Fixing it is a good first step.<br>
>><br>
>> > but pbs_sched service was not<br>
>> > found.<br>
>><br>
>> Starting up daemons in YDog may be different from RHEL, CentOS,<br>
>> Fedora,<br>
>> so I am just guessing based on the latter. Not familiar to YDog.<br>
>> Anyway ...<br>
>><br>
>> Don't know if you got Torque from ClusterResources or other.<br>
>> In any case, there should be a pbs_sched script on /etc/init.d<br>
>> If it is there, do "chkconfig --add pbs_sched" (or YDog equivalent),<br>
>> then do "chkconfig --list pbs_sched" to see which runlevels it will be<br>
>> on, then "service pbs_sched start" to start it, or if YDog doesn't<br>
>> have<br>
>> "service", run it with "/etc/init.d/pbs_sched start".<br>
>><br>
>> If you don't have the pbs_sched script in /etc/init.d, you may<br>
>> find one<br>
>> in the contrib subdirectory of the Torque source tree.<br>
>> Copy it over to /etc/init.d, and do the above.<br>
>> (The location may be other than /etc/init.d in YDog.)<br>
>><br>
>><br>
>> > I didn't install maui, it is a default installation. About hosts<br>
>> > file, it is properly configured as well as nodes and mom's<br>
>> config files.<br>
>> ><br>
>><br>
>> You only need Maui if you want a complex scheduling policy.<br>
>> pbs_sched is FIFO, very simple, but works fine.<br>
>> I've used it for a long time without problems.<br>
>><br>
>> > when I manually start pbs_sched it says<br>
>> ><br>
>> > pbs_sched: addclient, host localhost not found<br>
>> ><br>
>><br>
>> Hmm ... never got this one, not that I remember.<br>
>> Not sure what you mean by "manually start pbs_sched".<br>
>> Anyway, sounds as another, different, problem.<br>
>><br>
>><br>
>> Is it possible that your "hostname" command<br>
>> is not resolving your server name to rufian.perrera.local but to<br>
>> localhost?<br>
>> What is the output of "hostname"?<br>
>> What do you have in /etc/hosts?<br>
>> What do you have in /etc/sysconfig/network?<br>
>><br>
>> Just in case you have /etc/sysconfig/pbs_server and<br>
>> /etc/sysconfig/pbs_sched, what is the contents?<br>
>> (I don't have them.)<br>
>><br>
>> (Again just guessing, YDog may have different files to startup<br>
>> things.)<br>
>><br>
>> I hope this helps,<br>
>> Gus Correa<br>
>> ---------------------------------------------------------------------<br>
>> Gustavo Correa<br>
>> Lamont-Doherty Earth Observatory - Columbia University<br>
>> Palisades, NY, 10964-8000 - USA<br>
>> ---------------------------------------------------------------------<br>
>><br>
>> ><br>
>> > 2009/5/21 Samir Gartner <<a href="mailto:jigzat@gmail.com" target="_blank">jigzat@gmail.com</a><br>
</div></div>>> <mailto:<a href="mailto:jigzat@gmail.com" target="_blank">jigzat@gmail.com</a>> <mailto:<a href="mailto:jigzat@gmail.com" target="_blank">jigzat@gmail.com</a><br>
<div>>> <mailto:<a href="mailto:jigzat@gmail.com" target="_blank">jigzat@gmail.com</a>>>><br>
>> ><br>
>> > I think I'm gonna cry.... I love you guys!! No, seriously,<br>
>> it worked<br>
>> > but only if executed under root user, now the question is<br>
>> what did I<br>
>> > do wrong? Jobs should start automatically, right?<br>
>> ><br>
>> > I was following first the Globus tootlikt tutorial but it is<br>
>> kinda<br>
>> > outdated so I guess I issued some wrong instructions.<br>
>> ><br>
>> > On of the weird things was that the tutorial suggested using the<br>
>> > /opt/pbs prefix when executing configure and now I have under<br>
>> > /opt/pbs again a /opt/pbs folder with repeated bin and sbin<br>
>> folders<br>
>> > and executables. Is this wrong or is how it is supposed to be?<br>
>> ><br>
>> > 2009/5/21 Ling C. Ho <<a href="mailto:ling@fnal.gov" target="_blank">ling@fnal.gov</a> <mailto:<a href="mailto:ling@fnal.gov" target="_blank">ling@fnal.gov</a>><br>
</div>>> <mailto:<a href="mailto:ling@fnal.gov" target="_blank">ling@fnal.gov</a> <mailto:<a href="mailto:ling@fnal.gov" target="_blank">ling@fnal.gov</a>>>><br>
<div><div></div><div>>> ><br>
>> > Have you configured a scheduler?<br>
>> ><br>
>> > What if you use qrun. Would any job starts?<br>
>> ><br>
>> > ...<br>
>> > ling<br>
>> ><br>
>> > Samir Gartner wrote:<br>
>> ><br>
>> > Ok, I don't see any file named default_server but<br>
>> > server_name has the right server name<br>
>> rufian.perrera.local<br>
>> > and there is another file with the same content named<br>
>> > server_name.new.<br>
>> ><br>
>> > Righ now the PSB server name apears to be correct (after<br>
>> > stoping the server and manually deletting the zombie<br>
>> jobs)<br>
>> > but stil the jobs won't start.<br>
>> ><br>
>> ><br>
>> > [samir@rufian ~]$ echo "sleep 30;date" |<br>
>> /opt/pbs/bin/qsub<br>
>> > [samir@rufian ~]$ /opt/pbs/bin/qstat -a<br>
>> ><br>
>> > rufian.perrera.local:<br>
>> ><br>
>> > Req'd Req'd Elap<br>
>> > Job ID Username Queue Jobname<br>
>> > SessID NDS TSK Memory Time S Time<br>
>> > -------------------- -------- -------- ----------------<br>
>> > ------ ----- --- ------ ----- - -----<br>
>> > 13.rufian.perrer samir batch STDIN<br>
>> > -- 1 -- -- 01:00 Q --<br>
>> > [samir@rufian ~]$<br>
>> ><br>
>> ><br>
>> > by the way, is it top posting allowed??<br>
>> ><br>
>> > 2009/5/21 Jerry Smith <<a href="mailto:jdsmit@sandia.gov" target="_blank">jdsmit@sandia.gov</a><br>
>> <mailto:<a href="mailto:jdsmit@sandia.gov" target="_blank">jdsmit@sandia.gov</a>><br>
>> > <mailto:<a href="mailto:jdsmit@sandia.gov" target="_blank">jdsmit@sandia.gov</a><br>
</div></div>>> <mailto:<a href="mailto:jdsmit@sandia.gov" target="_blank">jdsmit@sandia.gov</a>>> <mailto:<a href="mailto:jdsmit@sandia.gov" target="_blank">jdsmit@sandia.gov</a><br>
>> <mailto:<a href="mailto:jdsmit@sandia.gov" target="_blank">jdsmit@sandia.gov</a>><br>
>> > <mailto:<a href="mailto:jdsmit@sandia.gov" target="_blank">jdsmit@sandia.gov</a> <mailto:<a href="mailto:jdsmit@sandia.gov" target="_blank">jdsmit@sandia.gov</a>>>>><br>
<div><div></div><div>>> ><br>
>> ><br>
>> > Samir,<br>
>> ><br>
>> > What do you have in<br>
>> $PBS_HOME/{server_name,default_server}?<br>
>> ><br>
>> > It should be what resolves as the ethernet<br>
>> address that<br>
>> > pbs should<br>
>> > be listening on.<br>
>> ><br>
>> > --Jerry<br>
>> ><br>
>> ><br>
>> ><br>
>> ><br>
>> > Samir Gartner wrote:<br>
>> ><br>
>> > Ok I finally installed torque under<br>
>> yellowdog/ppc but<br>
>> > now I have<br>
>> > another problem. I set up my pbs server as<br>
>> > rufian.perrera.local<br>
>> > but when I issue a job it shows itself in<br>
>> > localhost.localdomain<br>
>> > and it stays on queued state forever. And if<br>
>> i try to<br>
>> > qdel the<br>
>> > job it cant reach the server and the<br>
>> conection times<br>
>> > out. Any<br>
>> > ideas of what could be wrong?<br>
>> > I'm not trying to set up anything complicated, is<br>
>> > just one<br>
>> > machine that works as server and client.<br>
>> ><br>
>> > this is the shell output<br>
>> ><br>
>> > [root@rufian bin]# /opt/pbs/bin/qstat -a<br>
>> ><br>
>> > rufian.perrera.local:<br>
>> ><br>
>> > Req'd Req'd Elap<br>
>> > Job ID Username Queue Jobname<br>
>> > SessID<br>
>> > NDS TSK Memory Time S Time<br>
>> > -------------------- -------- --------<br>
>> > ---------------- ------<br>
>> > ----- --- ------ ----- - -----<br>
>> > 7.localhost.loca samir batch STDIN<br>
>> > -- 1 -- -- 01:00 Q --<br>
>> > 8.localhost.loca samir batch STDIN<br>
>> > -- 1 -- -- 01:00 Q --<br>
>> > 9.localhost.loca samir batch STDIN<br>
>> > -- 1 -- -- 01:00 Q --<br>
>> > 10.localhost.loc samir batch STDIN<br>
>> > -- 1 -- -- 01:00 Q --<br>
>> > [root@rufian bin]# /opt/pbs/bin/qdel<br>
>> > 7.localhost.localdomain<br>
>> > Connection timed out<br>
>> > qdel: cannot connect to server<br>
>> localhost.localdomain<br>
>> > (errno=110)<br>
>> > Connection timed out<br>
>> > You have new mail in /var/spool/mail/root<br>
>> > [root@rufian bin]# /opt/pbs/bin/qdel<br>
>> > 7.rufian.perrera.local<br>
>> > qdel: Unknown Job Id 7.rufian.perrera.local<br>
>> > [root@rufian bin]# su - samir<br>
>> > [samir@rufian ~]$ /opt/pbs/bin/qdel<br>
>> > 7.localhost.localdomain<br>
>> > Connection timed out<br>
>> > qdel: cannot connect to server<br>
>> localhost.localdomain<br>
>> > (errno=110)<br>
>> > Connection timed out<br>
>> > [samir@rufian ~]$<br>
>> ><br>
>> ><br>
>> ><br>
>> ><br>
>> ><br>
>> ------------------------------------------------------------------------<br>
>> ><br>
>> > _______________________________________________<br>
>> > torqueusers mailing list<br>
>> > <a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a><br>
>> <mailto:<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a>><br>
</div></div>>> > <mailto:<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a><br>
<div>>> <mailto:<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a>>><br>
>> > <a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
>> ><br>
>> ><br>
>> ><br>
>> ><br>
>> ><br>
>> ><br>
>> ><br>
>> ><br>
>> ------------------------------------------------------------------------<br>
>> ><br>
>> > _______________________________________________<br>
>> > torqueusers mailing list<br>
</div>>> > <a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a> <mailto:<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a>><br>
<div>>> > <a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
>><br>
>> _______________________________________________<br>
>> torqueusers mailing list<br>
</div>>> <a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a> <mailto:<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a>><br>
<div><div></div><div>>> <a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
>><br>
>><br>
<br>
_______________________________________________<br>
torqueusers mailing list<br>
<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a><br>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
</div></div></blockquote></div><br>
</div></div></blockquote></div><br>
</div></div></blockquote></div><br>