<HTML dir=ltr><HEAD><TITLE>Re: [torqueusers] pbs_mom unable to chdir to automounted dirs</TITLE>
<META http-equiv=Content-Type content="text/html; charset=unicode">
<META content="MSHTML 6.00.2900.3429" name=GENERATOR></HEAD>
<BODY>
<DIV id=idOWAReplyText74587 dir=ltr>
<DIV dir=ltr><FONT face=Arial color=#000000 size=2>I don't have a real useful suggestion, but just a thought -- could it simply be a timing issue in that pbs_mom is trying to stat a file or the directory before it's been fully mounted? It may take a second to get the directory mounted if it wasn't already, and maybe PBS is too fast for the auto-mounter, esp if the NFS server is under some sort of load and could be taking a little longer to respond than normal?</FONT></DIV>
<DIV dir=ltr><FONT face=Arial size=2></FONT> </DIV>
<DIV dir=ltr><FONT face=Arial size=2>--Joe</FONT></DIV></DIV>
<DIV dir=ltr><BR>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> torqueusers-bounces@supercluster.org on behalf of Mary Ellen Fitzpatrick<BR><B>Sent:</B> Wed 10/22/2008 2:53 PM<BR><B>To:</B> Luke Scharf<BR><B>Cc:</B> torqueusers@supercluster.org<BR><B>Subject:</B> Re: [torqueusers] pbs_mom unable to chdir to automounted dirs<BR></FONT><BR></DIV>
<DIV>
<P><FONT size=2>The node OS is CentOS5 as is the nfs server. The pbs server is running<BR>CentOS4.5. I have rebooted and chanted... :-) :-(<BR><BR>Here is my simple pbs script and it does not have absolute paths. The<BR>script will run only after the nfs dirs are somehow mounted on the<BR>node. I have tried it with absolute path names, and it makes no difference.<BR><BR>pbs script:<BR>#PBS -l nodes=node1048<BR># join stderr and stdout and write the to a file<BR>#PBS -j oe<BR>#PBS -o test3.o<BR> <BR># cd into the working directory<BR>cd $PBS_O_WORKDIR<BR># print out some diagnostic stuff<BR>echo Running on host `hostname`<BR>echo Directory is `pwd`<BR>echo Start time is `date`<BR># run my commands<BR><BR>./dostuff2.pl data.txt > test3.out1<BR><BR># print out some diagnostic stuff<BR>echo Stop time is `date`<BR><BR><BR><BR>Luke Scharf wrote:<BR>> If it works with the shell, however, the problem almost has to be with<BR>> something other than the automounter.<BR>><BR>> Are any asbolute paths in the qsub script correct?<BR>><BR>> -Luke<BR>><BR>><BR>> Luke Scharf wrote:<BR>>> That looks happy, too.<BR>>><BR>>> What is the underlying OS running on the node?<BR>>><BR>>> Have you tried just rebooting everything while muttering laments<BR>>> about stray alpha-particles to everyone within earshot?<BR>>><BR>>> -Luke<BR>>><BR>>> Mary Ellen Fitzpatrick wrote:<BR>>>> Yeah, that is why I am stumped... because I can cd to nfs dirs,<BR>>>> seems like autofs is working correctly. But unless the nfs dir is<BR>>>> pre-mounted, pbs_mom can not find it. Very strange...<BR>>>><BR>>>> Yes, getent passwd give the correct home dir info<BR>>>> [root@node1048 mom_priv]# getent passwd<BR>>>> mfitzpat:!!:xxxxxx:xxx:mfitzpat:/fs/userB1/mfitzpat:/bin/bash<BR>>>><BR>>>> Luke Scharf wrote:<BR>>>>> Nothing that you mention looks amiss at first glance...<BR>>>>><BR>>>>><BR>>>>> Does the "getent passwd" information for the user have a correct<BR>>>>> home directory on the node?<BR>>>>><BR>>>>> -Luke<BR>>>>><BR>>>>><BR>>>>> Mary Ellen Fitzpatrick wrote:<BR>>>>>> Thanks Luke.<BR>>>>>> Right now, my cluster is one node, with additional 50 to be<BR>>>>>> brought on-line once I resolve the automount problem. The job I<BR>>>>>> am running is very simple, no nfs load on server.<BR>>>>>><BR>>>>>> my $usecp I believe is correct and works properly after the nfs<BR>>>>>> dir is mounted.<BR>>>>>> $usecp *:/fs/userB1 /fs/userB1<BR>>>>>><BR>>>>>> My auto.home file:<BR>>>>>> userB1 -rw,hard,intr userB:/userB/u1<BR>>>>>><BR>>>>>> auto.master file:<BR>>>>>> #+auto.master<BR>>>>>> /fs /etc/auto.home<BR>>>>>><BR>>>>>> I believe it is an automount issue and I need to tweak a parameter<BR>>>>>> in a config file. Not sure which one it is at this point.<BR>>>>>><BR>>>>>><BR>>>>>> Luke Scharf wrote:<BR>>>>>>> Mary Ellen Fitzpatrick wrote:<BR>>>>>>>> I have my home dirs nfs exported to all of my compute nodes. I<BR>>>>>>>> can log into the nodes and cd the nfs mounted dirs, no problem.<BR>>>>>>>> When I submit a job to a node and the automounted nfs dirs are<BR>>>>>>>> not mount (timed out), I get the following error:<BR>>>>>>>><BR>>>>>>>> Oct 21 16:08:14 node1047 pbs_mom: No such file or directory (2)<BR>>>>>>>> in TMomFinalizeChild, PBS: chdir to '/fs/userB1/mfitzpat'<BR>>>>>>>> failed: No such file or directory<BR>>>>>>>><BR>>>>>>>> If I immediately resubmit the job to the same node, it will<BR>>>>>>>> run. It appears that pbs wants the automounted nfs dirs to be<BR>>>>>>>> already mounted, then the job will run. If I hard mount the nfs<BR>>>>>>>> home dirs, I have no problem running the jobs, but I do not want<BR>>>>>>>> to do that.<BR>>>>>>>><BR>>>>>>>> Any one run into this? Trying to figure out if it is a torque<BR>>>>>>>> issue or automount issue.<BR>>>>>>><BR>>>>>>> How big is your cluster? How capable is the NFS server? A<BR>>>>>>> job-start is likely to create a mountstorm, and generate a lot of<BR>>>>>>> I/O. Some servers can handle it, some can't.<BR>>>>>>><BR>>>>>>> Yay for scaling issues!<BR>>>>>>><BR>>>>>>> -Luke<BR>>>>>>><BR>>>>>>> P.S. I second the suggestion of checking the $usecp value.<BR>>>>>><BR>>>>><BR>>>><BR>>><BR>><BR><BR>--<BR>Thanks<BR>Mary Ellen<BR><BR>_______________________________________________<BR>torqueusers mailing list<BR>torqueusers@supercluster.org<BR><A href="http://www.supercluster.org/mailman/listinfo/torqueusers">http://www.supercluster.org/mailman/listinfo/torqueusers</A><BR></FONT></P></DIV></BODY></HTML>