[Moabusers] Getting Torque to run on a diskless cluster -
Communication with nodes failed
Gelonia L Dent
gdent at amnh.org
Wed Mar 19 12:32:31 MDT 2008
Any suggestions or input?
I am trying to get my nodes to communicate with the hostserver. I've not
been successful and I've tried just about everything. My latest
question/concern is
Do I need a torque.cfg file in the /usr/spool/PBS directory.
According to the installation notes the nodes file should be in
/usr/spool/PBS but I have it in /var/spool/torque.
Is the default working directory for torque /usr/spool/PBS?
Many thanks,
--
Gelonia Dent, PhD
Manager of Scientific Computing
Invertebrate Zoology
The American Museum of Natural History
(212) 313-7911
> Gelonia,
>
> pbs_server does need to be running while moab is running. It does not run
> automatically when Moab is started. You will have to start it separately.
> Though make sure you are starting pbs_server and not pbs_sched. You can
> verify the pbs_server is running by running pbsndoes -a. It appears the
> TORQUE is properly passing information to Moab, so we may be closer.
>
> Regards,
> Nick
>
>> I stopped the pbs_server and started moab. Why is the message the PBS
>> isn't running? Should it be running with moab started ?
>>
>>
>> demeter:~# mdiag -R -v
>> diagnosing resource managers
>>
>> RM[demeter] State: Down
>> Type: PBS ResourceType: COMPUTE
>> Version: '2.2.1'
>> Objects Reported: Nodes=127 (256 procs) Jobs=0
>> Flags: executionServer
>> Partition: demeter
>> Event Management: EPORT=15004 (last event: 00:03:59)
>> DefaultClass: batch
>> RM Performance: AvgTime=0.00s MaxTime=3.69s (135933 samples)
>> RM Languages: PBS
>> RM Sub-Languages: -
>>
>> Message[0] cannot connect to PBS server '' (pbs_server may not be
>> running)
>>
>> RM[demeter] Failures:
>> clusterquery (2704 of 35324 failed)
>> -00:16:27 'cannot connect to PBS server '' (pbs_server may not be
>> running)'
>> -00:15:56 'cannot connect to PBS server '' (pbs_server may not be
>> running)'
>> -00:15:25 'cannot connect to PBS server '' (pbs_server may not be
>> running)'
>> -00:14:54 'cannot connect to PBS server '' (pbs_server may not be
>> running)'
>> -00:14:23 'cannot connect to PBS server '' (pbs_server may not be
>> running)'
>> -00:00:53 'End of File'
>> -00:00:22 'cannot connect to PBS server '' (pbs_server may not be
>> running)'
>> queuequery (22 of 32642 failed)
>> -00:00:53 'cannot get queue info - no data available'
>>
>>
>> NOTE: use 'mrmctl -f messages <RMID>' to clear stats/failures
>>
>> --
>> Gelonia Dent, PhD
>> Manager of Scientific Computing
>> Invertebrate Zoology
>> The American Museum of Natural History
>> (212) 313-7911
>>
>>
>>
>>
>>> Gelonia,
>>>
>>> If the permissions are still giving you problems on the Torque
>>> directory
>>> I think you should start from scratch for the TORQUE install. Here is
>>> a
>>> way that might be a little more simple. Once you have done the TORQUE
>>> install on the headnode, ran the torque.setup config script and created
>>> the nodes file in /var/spool/torque/server_priv verify that you can run
>>> pbs_server. Once you verify that you can run pbs_server without any
>>> error, stop the pbs_server process.
>>>
>>> Now the nodes should still be able to see /usr/local so they will be
>>> able to see the client commands and pbs_mom. However, each image also
>>> needs to have their own directory /var/spool/torque/mom_priv and
>>> /var/spool/torque/mom_logs. You can either create those directories in
>>> the image or have your script do so. Check on the headnode to see the
>>> permissions of each directory. In /var/spool/torque/mom_priv you will
>>> create a file called: config In it you will need the following:
>>>
>>> $pbsserver demeter
>>>
>>> Now the pbs_mom on each node should start. If you are having troubles
>>> with the LD_LIBRARY_PATH, in each node image create the file name
>>> environment in /etc. In the environment file add:
>>>
>>> LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
>>>
>>> Now verify that the pbs_mom start up on each node, then start
>>> pbs_server
>>> on the head node and run pbsnodes -a to see if all the nodes are up.
>>> If
>>> some aren't, try ruunning the momctl -d3 command on a down node. If
>>> that command doesn't work for some reason, run the command on the
>>> headnode adding -h nodename. Send me those outputs.
>>>
>>> I noticed you submitted an email to the Moab userslist. I think you
>>> will get better help if you submit it to the TORQUE userlist at
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>> I will call you in the morning at 11 am eastern so we can talk more.
>>>
>>> Regards,
>>> Nick
>>>
>>>
>>>
>>> Gelonia L Dent wrote:
>>>> demeter:~# pbs_mom
>>>> pbs_mom: Permission denied (13) in chk_file_sec, Security violation
>>>> with
>>>> "/var/spool/torque" - /var/spool/torque is world writable and not
>>>> sticky
>>>> pbs_mom: Permission denied (13) in chk_file_sec, Security violation
>>>> with
>>>> "/var/spool/torque" - /var/spool/torque is world writable and not
>>>> sticky
>>>>
>>>>
>>>> demeter:/var/spool# ls -l
>>>> total 12
>>>> drwxr-xr-x 5 root root 4096 2005-03-24 17:33 cron
>>>> drwxr-x--- 5 Debian-exim Debian-exim 4096 2006-11-07 06:25 exim4
>>>> lrwxrwxrwx 1 root root 7 2006-06-09 00:44 mail ->
>>>> ../mail
>>>> drwxrwxrwt 12 root root 280 2008-03-13 15:31 torque
>>>> drwxr-xr-x 12 root root 4096 2008-03-13 15:34 torque-bak
>>>>
>>>
>>>
>>
>>
>>
>
>
>
More information about the moabusers
mailing list