[Moabusers] Getting Torque to run on a diskless cluster - Communication with nodes failed

Gelonia L Dent gdent at amnh.org
Wed Mar 19 12:32:31 MDT 2008


Any suggestions or input?

I am trying to get my nodes to communicate with the hostserver. I've not
been successful and I've tried just about everything. My latest
question/concern is

Do I  need  a torque.cfg file in the /usr/spool/PBS directory.
According to the installation notes the nodes file should be in
/usr/spool/PBS but I have it in /var/spool/torque.

Is the default working directory for torque /usr/spool/PBS?

Many thanks,

--
Gelonia Dent, PhD
Manager of Scientific Computing
Invertebrate Zoology
The American Museum of Natural History
(212) 313-7911




> Gelonia,
>
> pbs_server does need to be running while moab is running.  It does not run
> automatically when Moab is started.  You will have to start it separately.
>  Though make sure you are starting pbs_server and not pbs_sched.  You can
> verify the pbs_server is running by running pbsndoes -a.  It appears the
> TORQUE is properly passing information to Moab, so we may be closer.
>
> Regards,
> Nick
>
>> I stopped the pbs_server and started moab. Why is the message the PBS
>> isn't running? Should it be running with moab started ?
>>
>>
>> demeter:~#  mdiag -R -v
>> diagnosing resource managers
>>
>> RM[demeter]  State: Down
>>   Type:               PBS  ResourceType: COMPUTE
>>   Version:            '2.2.1'
>>   Objects Reported:   Nodes=127 (256 procs)  Jobs=0
>>   Flags:              executionServer
>>   Partition:          demeter
>>   Event Management:   EPORT=15004  (last event: 00:03:59)
>>   DefaultClass:       batch
>>   RM Performance:     AvgTime=0.00s  MaxTime=3.69s  (135933 samples)
>>   RM Languages:       PBS
>>   RM Sub-Languages:   -
>>
>> Message[0] cannot connect to PBS server '' (pbs_server may not be
>> running)
>>
>> RM[demeter] Failures:
>>   clusterquery     (2704 of 35324 failed)
>>       -00:16:27  'cannot connect to PBS server '' (pbs_server may not be
>> running)'
>>       -00:15:56  'cannot connect to PBS server '' (pbs_server may not be
>> running)'
>>       -00:15:25  'cannot connect to PBS server '' (pbs_server may not be
>> running)'
>>       -00:14:54  'cannot connect to PBS server '' (pbs_server may not be
>> running)'
>>       -00:14:23  'cannot connect to PBS server '' (pbs_server may not be
>> running)'
>>       -00:00:53  'End of File'
>>       -00:00:22  'cannot connect to PBS server '' (pbs_server may not be
>> running)'
>>   queuequery       (22 of 32642 failed)
>>       -00:00:53  'cannot get queue info - no data available'
>>
>>
>> NOTE:  use 'mrmctl -f messages <RMID>' to clear stats/failures
>>
>> --
>> Gelonia Dent, PhD
>> Manager of Scientific Computing
>> Invertebrate Zoology
>> The American Museum of Natural History
>> (212) 313-7911
>>
>>
>>
>>
>>> Gelonia,
>>>
>>> If the permissions are still giving you problems on the Torque
>>> directory
>>> I think you should start from scratch for the TORQUE install.  Here is
>>> a
>>> way that might be a little more simple.  Once you have done the TORQUE
>>> install on the headnode, ran the torque.setup config script and created
>>> the nodes file in /var/spool/torque/server_priv verify that you can run
>>> pbs_server.  Once you verify that you can run pbs_server without any
>>> error, stop the pbs_server process.
>>>
>>> Now the nodes should still be able to see /usr/local so they will be
>>> able to see the client commands and pbs_mom.   However, each image also
>>> needs to have their own directory /var/spool/torque/mom_priv and
>>> /var/spool/torque/mom_logs. You can either create those directories in
>>> the image or have your script do so.  Check on the headnode to see the
>>> permissions of each directory.  In /var/spool/torque/mom_priv you will
>>> create a file called: config  In it you will need the following:
>>>
>>> $pbsserver demeter
>>>
>>> Now the pbs_mom on each node should start.  If you are having troubles
>>> with the LD_LIBRARY_PATH, in each node image create the file name
>>> environment in /etc.  In the environment file add:
>>>
>>> LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
>>>
>>> Now verify that the pbs_mom start up on each node, then start
>>> pbs_server
>>> on the head node and run pbsnodes -a to see if all the nodes are up.
>>> If
>>> some aren't, try ruunning the momctl -d3 command on a down node.  If
>>> that command doesn't work for some reason, run the command on the
>>> headnode adding -h nodename.  Send me those outputs.
>>>
>>> I noticed you submitted an email to the Moab userslist.  I think you
>>> will get better help if you submit it to the TORQUE userlist at
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>> I will call you in the morning at 11 am eastern so we can talk more.
>>>
>>> Regards,
>>> Nick
>>>
>>>
>>>
>>> Gelonia L Dent wrote:
>>>> demeter:~# pbs_mom
>>>> pbs_mom: Permission denied (13) in chk_file_sec, Security violation
>>>> with
>>>> "/var/spool/torque" - /var/spool/torque is world writable and not
>>>> sticky
>>>> pbs_mom: Permission denied (13) in chk_file_sec, Security violation
>>>> with
>>>> "/var/spool/torque" - /var/spool/torque is world writable and not
>>>> sticky
>>>>
>>>>
>>>> demeter:/var/spool# ls -l
>>>> total 12
>>>> drwxr-xr-x  5 root        root        4096 2005-03-24 17:33 cron
>>>> drwxr-x---  5 Debian-exim Debian-exim 4096 2006-11-07 06:25 exim4
>>>> lrwxrwxrwx  1 root        root           7 2006-06-09 00:44 mail ->
>>>> ../mail
>>>> drwxrwxrwt 12 root        root         280 2008-03-13 15:31 torque
>>>> drwxr-xr-x 12 root        root        4096 2008-03-13 15:34 torque-bak
>>>>
>>>
>>>
>>
>>
>>
>
>
>




More information about the moabusers mailing list