[torqueusers] wrong pbs server name

Gus Correa gus at ldeo.columbia.edu
Thu May 21 15:56:01 MDT 2009


Hi Samir

As Jerry said, 127.0.0.1 is the IP address of the "loopback interface"
(not a physical Ethernet port) on each computer.
This is not to be confused with the IP address associated to the actual
Ethernet port on the network you want to use for MPI communication.


1) Looking at your hosts file and your question ("Is this wrong?"),
I would suggest:

A) Uncomment this line:

#127.0.0.1              localhost.localdomain localhost

i.e, it should be:

127.0.0.1              localhost.localdomain localhost

You need a loopback, but pointing to localhost.

B) Change this line:

127.0.0.1 rufian.perrera.local rufian

to something like this:

192.168.2.1 rufian.perrera.local rufian

assuming the IP address 192.168.2.1 is not in use on your (private) net
192.168.2.0 (otherwise use another IP on the same net).

C) Make sure this is consistent with whatever you have in 
/etc/sysconfig/network.
(It seems to be OK, you only have the hostname, not the IP there.)

D) Restart the network, or much easier, just reboot the computer.

E) Make sure your other computers (auyin, pelusa, lamparita)
have correct hosts file too, which should
list in a consistent way all the computers on your 192.168.2.0 net,
include the loopback interface as explained above.

F) Each computer loops back to itself with the same special address
127.0.0.1, as Jerry explained.
The IP 127.0.0.1 cannot be used as a regular IP.
If you read carefully the top lines on the hosts file you will see
the message: "Do not remove the following line".
Well, commenting it out has the same effect, replacing it by another
hostname is even worse, as you may have noticed.

 >> # Do not remove the following line, or various programs
 >> # that require network functionality will fail.
 >> #127.0.0.1              localhost.localdomain localhost


***

2) As I and Jerry told you, copy over the pbs_sched in the contrib 
directory to /etc/init.d, use chkconfig --add pbs_sched
to come up when the machine boots, and start it manually this first
time (/etc/init.d/pbs_sched start), or just reboot again.


3) Don't worry about not having /etc/sysconfig/pbs_server or 
/etc/sysconfig/pbs_sched.  It seems to be a legacy way to setup 
Torque/PBS.  I don't have them either, and it works.

I hope this helps,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------



Jerry Smith wrote:
> 127.0.0.1 is a special address that references localhost.
> http://en.wikipedia.org/wiki/Localhost
> 
> 
> 
> 127.0.0.1  is not what you want for your hostname ( pbs_moms trying to 
> connect to 127.0.0.1 will try to talk to themselves)
> 
> You will want to setup an IP address on your pbs_server/scheduler node 
> that corresponds to the network that your pbs_moms are on.
> And then make sure that the hostname you give it matches that of the 
> file in $PBS_HOME/server
> 
> Copying the init script to /etc/init.d is a start, you will then 
> probably need to turn it on by running :
> 
> To set it up to start on reboot:
> 
> chkconfig add pbs_sched
> and then
> chkconfig pbs_sched on
> 
> To start it use /etc/init.d/pbs_sched start
> 
> 
> --Jerry
> 
> 
> Samir Gartner wrote:
>> Ok Gus and everyone. Thanks again for your answers.
>>
>> There is no pbs_sched on /etc/init.d but it is here:
>>
>> /usr/local/src/torque-2.3.6/contrib/init.d/pbs_sched
>> /usr/local/src/torque-2.3.6/tpackages/server/opt/pbs/sbin/pbs_sched
>> /usr/local/src/torque-2.3.6/src/scheduler.cc/.libs/pbs_sched
>> /usr/local/src/torque-2.3.6/src/scheduler.cc/pbs_sched
>> /opt/pbs/sbin/pbs_sched
>>
>> I was thinking copying /opt/pbs/sbin/pbs_sched to /etc/init.d. Is it 
>> right to do that?
>>
>> Sorry about the "manually" word. It is local slang I guess. What I 
>> mean is that I went to the /opt/pbs/sbin/ folder and executed ./pbs_sched
>>
>> hostname output is:
>>
>> rufian.perrera.local
>>
>> hosts file contain:
>>
>> # Do not remove the following line, or various programs
>> # that require network functionality will fail.
>> #127.0.0.1              localhost.localdomain localhost    
>> <--------------------------Is this wrong?
>> ::1             localhost6.localdomain6 localhost6
>> 127.0.0.1 rufian.perrera.local rufian
>> 192.168.2.6 auyin.perrera.local auyin
>> 192.168.2.4 pelusa.perrera.local pelusa
>> 192.168.2.2 lamparita.perrera.local lamparita
>>
>>
>> network content is:
>>
>> NETWORKING=yes
>> HOSTNAME=rufian.perrera.local
>> DOMAINNAME=perrera.local
>>
>> I dont have /etc/sysconfig/pbs_server nor /etc/sysconfig/pbs_sched either
>>
>>
>> 2009/5/21 Gus Correa <gus at ldeo.columbia.edu 
>> <mailto:gus at ldeo.columbia.edu>>
>>
>>     Samir Gartner wrote:
>>     > Ok, scheduling wasn't enabled,now it is,
>>
>>     It happens very often.
>>     Fixing it is a good first step.
>>
>>     > but pbs_sched service was not
>>     > found.
>>
>>     Starting up daemons in YDog may be different from RHEL, CentOS,
>>     Fedora,
>>     so I am just guessing based on the latter. Not familiar to YDog.
>>     Anyway ...
>>
>>     Don't know if you got Torque from ClusterResources or other.
>>     In any case, there should be a pbs_sched script on /etc/init.d
>>     If it is there, do "chkconfig --add pbs_sched" (or YDog equivalent),
>>     then do "chkconfig --list pbs_sched" to see which runlevels it will be
>>     on, then "service pbs_sched start" to start it, or if YDog doesn't
>>     have
>>     "service", run it with "/etc/init.d/pbs_sched start".
>>
>>     If you don't have the pbs_sched script in /etc/init.d, you may
>>     find one
>>     in the contrib subdirectory of the Torque source tree.
>>     Copy it over to /etc/init.d, and do the above.
>>     (The location may be other than /etc/init.d in YDog.)
>>
>>
>>     > I didn't install maui, it is a default installation. About hosts
>>     > file, it is properly configured as well as nodes and mom's
>>     config files.
>>     >
>>
>>     You only need Maui if you want a complex scheduling policy.
>>     pbs_sched is FIFO, very simple, but works fine.
>>     I've used it for a long time without problems.
>>
>>     > when I manually start pbs_sched it says
>>     >
>>     > pbs_sched: addclient, host localhost not found
>>     >
>>
>>     Hmm ... never got this one, not that I remember.
>>     Not sure what you mean by "manually start pbs_sched".
>>     Anyway, sounds as another, different, problem.
>>
>>
>>     Is it possible that your "hostname" command
>>     is not resolving your server name to rufian.perrera.local but to
>>     localhost?
>>     What is the output of "hostname"?
>>     What do you have in /etc/hosts?
>>     What do you have in /etc/sysconfig/network?
>>
>>     Just in case you have  /etc/sysconfig/pbs_server and
>>     /etc/sysconfig/pbs_sched, what is the contents?
>>     (I don't have them.)
>>
>>     (Again just guessing, YDog may have different files to startup
>>     things.)
>>
>>     I hope this helps,
>>     Gus Correa
>>     ---------------------------------------------------------------------
>>     Gustavo Correa
>>     Lamont-Doherty Earth Observatory - Columbia University
>>     Palisades, NY, 10964-8000 - USA
>>     ---------------------------------------------------------------------
>>
>>     >
>>     > 2009/5/21 Samir Gartner <jigzat at gmail.com
>>     <mailto:jigzat at gmail.com> <mailto:jigzat at gmail.com
>>     <mailto:jigzat at gmail.com>>>
>>     >
>>     >     I think I'm gonna cry.... I love you guys!! No, seriously,
>>     it worked
>>     >     but only if executed under root user, now the question is
>>     what did I
>>     >     do wrong? Jobs should start automatically, right?
>>     >
>>     >     I was following first the Globus tootlikt tutorial but it is
>>     kinda
>>     >     outdated so I guess I issued some wrong instructions.
>>     >
>>     >     On of the weird things was that the tutorial suggested using the
>>     >     /opt/pbs prefix when executing configure and now I have under
>>     >     /opt/pbs again a /opt/pbs folder with repeated bin and sbin
>>     folders
>>     >     and executables. Is this wrong or is how it is supposed to be?
>>     >
>>     >     2009/5/21 Ling C. Ho <ling at fnal.gov <mailto:ling at fnal.gov>
>>     <mailto:ling at fnal.gov <mailto:ling at fnal.gov>>>
>>     >
>>     >         Have you configured a scheduler?
>>     >
>>     >         What if you use qrun. Would any job starts?
>>     >
>>     >         ...
>>     >         ling
>>     >
>>     >         Samir Gartner wrote:
>>     >
>>     >             Ok, I don't see any file named default_server but
>>     >             server_name has the right server name
>>     rufian.perrera.local
>>     >             and there is another file with the same content named
>>     >             server_name.new.
>>     >
>>     >             Righ now the PSB server name apears to be correct (after
>>     >             stoping the server and manually deletting the zombie
>>     jobs)
>>     >             but stil the jobs won't start.
>>     >
>>     >
>>     >             [samir at rufian ~]$ echo "sleep 30;date" |
>>     /opt/pbs/bin/qsub
>>     >             [samir at rufian ~]$ /opt/pbs/bin/qstat -a
>>     >
>>     >             rufian.perrera.local:
>>     >
>>     >                         Req'd  Req'd   Elap
>>     >             Job ID               Username Queue    Jobname
>>     >              SessID NDS   TSK Memory Time  S Time
>>     >             -------------------- -------- -------- ----------------
>>     >             ------ ----- --- ------ ----- - -----
>>     >             13.rufian.perrer     samir    batch    STDIN
>>     >             --      1  --    --  01:00 Q   --
>>     >             [samir at rufian ~]$
>>     >
>>     >
>>     >             by the way, is it top posting allowed??
>>     >
>>     >             2009/5/21 Jerry Smith <jdsmit at sandia.gov
>>     <mailto:jdsmit at sandia.gov>
>>     >             <mailto:jdsmit at sandia.gov
>>     <mailto:jdsmit at sandia.gov>> <mailto:jdsmit at sandia.gov
>>     <mailto:jdsmit at sandia.gov>
>>     >             <mailto:jdsmit at sandia.gov <mailto:jdsmit at sandia.gov>>>>
>>     >
>>     >
>>     >                Samir,
>>     >
>>     >                What do you have in
>>     $PBS_HOME/{server_name,default_server}?
>>     >
>>     >                It should be what resolves as the ethernet
>>     address that
>>     >             pbs should
>>     >                be listening on.
>>     >
>>     >                --Jerry
>>     >
>>     >
>>     >
>>     >
>>     >                Samir Gartner wrote:
>>     >
>>     >                    Ok I finally installed torque under
>>     yellowdog/ppc but
>>     >             now I have
>>     >                    another problem. I set up my pbs server as
>>     >             rufian.perrera.local
>>     >                    but when I issue a job it shows itself in
>>     >             localhost.localdomain
>>     >                    and it stays on queued state forever. And if
>>     i try to
>>     >             qdel the
>>     >                    job it cant reach the server and the
>>     conection times
>>     >             out. Any
>>     >                    ideas of what could be wrong?
>>     >                    I'm not trying to set up anything complicated, is
>>     >             just one
>>     >                    machine that works as server and client.
>>     >
>>     >                    this is the shell output
>>     >
>>     >                    [root at rufian bin]# /opt/pbs/bin/qstat -a
>>     >
>>     >                    rufian.perrera.local:
>>     >
>>     >                                      Req'd  Req'd   Elap
>>     >                    Job ID               Username Queue    Jobname
>>     >                SessID
>>     >                    NDS   TSK Memory Time  S Time
>>     >                    -------------------- -------- --------
>>     >             ---------------- ------
>>     >                    ----- --- ------ ----- - -----
>>     >                    7.localhost.loca     samir    batch    STDIN
>>     >                   --             1  --    --  01:00 Q   --
>>     >                    8.localhost.loca     samir    batch    STDIN
>>     >                   --             1  --    --  01:00 Q   --
>>     >                    9.localhost.loca     samir    batch    STDIN
>>     >                   --             1  --    --  01:00 Q   --
>>     >                    10.localhost.loc     samir    batch    STDIN
>>     >                   --             1  --    --  01:00 Q   --
>>     >                    [root at rufian bin]# /opt/pbs/bin/qdel
>>     >             7.localhost.localdomain
>>     >                    Connection timed out
>>     >                    qdel: cannot connect to server
>>     localhost.localdomain
>>     >             (errno=110)
>>     >                    Connection timed out
>>     >                    You have new mail in /var/spool/mail/root
>>     >                    [root at rufian bin]# /opt/pbs/bin/qdel
>>     >             7.rufian.perrera.local
>>     >                    qdel: Unknown Job Id 7.rufian.perrera.local
>>     >                    [root at rufian bin]# su - samir
>>     >                    [samir at rufian ~]$ /opt/pbs/bin/qdel
>>     >             7.localhost.localdomain
>>     >                    Connection timed out
>>     >                    qdel: cannot connect to server
>>     localhost.localdomain
>>     >             (errno=110)
>>     >                    Connection timed out
>>     >                    [samir at rufian ~]$
>>     >
>>     >
>>     >
>>     >
>>     >            
>>     ------------------------------------------------------------------------
>>     >
>>     >             _______________________________________________
>>     >             torqueusers mailing list
>>     >             torqueusers at supercluster.org
>>     <mailto:torqueusers at supercluster.org>
>>     >             <mailto:torqueusers at supercluster.org
>>     <mailto:torqueusers at supercluster.org>>
>>     >             http://www.supercluster.org/mailman/listinfo/torqueusers
>>     >
>>     >
>>     >
>>     >
>>     >
>>     >
>>     >
>>     >
>>     ------------------------------------------------------------------------
>>     >
>>     > _______________________________________________
>>     > torqueusers mailing list
>>     > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>>     > http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>     _______________________________________________
>>     torqueusers mailing list
>>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>>     http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>



More information about the torqueusers mailing list