<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML dir=ltr><HEAD><TITLE>Re: [torqueusers] pbs_mom unable to chdir to automounted dirs --- RESOLVED</TITLE>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.3354" name=GENERATOR></HEAD>
<BODY>
<DIV dir=ltr align=left><SPAN class=775184219-23102008><FONT face=Arial
color=#0000ff size=2>Another thing to look at is potentially uping your
"retrans" and/or "timeo" options on the nfs clients.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=775184219-23102008><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=775184219-23102008> <FONT
face=Arial color=#0000ff size=2>Stewart</FONT></SPAN></DIV><BR>
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> torqueusers-bounces@supercluster.org
[mailto:torqueusers-bounces@supercluster.org] <B>On Behalf Of </B>Greenseid,
Joseph M.<BR><B>Sent:</B> Thursday, October 23, 2008 3:32 PM<BR><B>To:</B> Mary
Ellen Fitzpatrick<BR><B>Cc:</B> torqueusers@supercluster.org<BR><B>Subject:</B>
RE: [torqueusers] pbs_mom unable to chdir to automounted dirs
---RESOLVED<BR></FONT><BR></DIV>
<DIV></DIV>
<DIV id=idOWAReplyText24360 dir=ltr>
<DIV dir=ltr><FONT face=Arial color=#000000 size=2>mary ellen,</FONT></DIV>
<DIV dir=ltr><FONT face=Arial color=#000000 size=2></FONT> </DIV>
<DIV dir=ltr><FONT face=Arial size=2>glad to hear your environment is
working. </FONT></DIV>
<DIV dir=ltr><FONT face=Arial size=2></FONT> </DIV>
<DIV dir=ltr><FONT face=Arial size=2>another random thought -- </FONT><FONT
face=Arial color=#000000 size=2>when you were using soft mounts, how many nfs
daemons were you running? did you up the number of daemons from the
default to see if that caused the server to respond faster?</FONT></DIV>
<DIV dir=ltr><FONT face=Arial size=2></FONT> </DIV>
<DIV dir=ltr><FONT face=Arial size=2>--Joe</FONT></DIV></DIV>
<DIV dir=ltr><BR>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> Mary Ellen Fitzpatrick
[mailto:mfitzpat@bu.edu]<BR><B>Sent:</B> Thu 10/23/2008 1:32 PM<BR><B>To:</B>
Garrick<BR><B>Cc:</B> Greenseid, Joseph M.;
torqueusers@supercluster.org<BR><B>Subject:</B> Re: [torqueusers] pbs_mom unable
to chdir to automounted dirs --- RESOLVED<BR></FONT><BR></DIV>
<DIV>
<P><FONT size=2>Good news...hard mounting the nfs dirs on the compute nodes
worked. A<BR>couple glitches along the way, but I got it to work.<BR>had
to turn off autofs on all nodes.<BR><BR>Thanks to everyone to your
input/suggestions.<BR>Mary Ellen<BR><BR>Mary Ellen Fitzpatrick wrote:<BR>>
Thanks. I am testing th hard nfs mounts on the compute nodes now
and<BR>> will, hoping post my successful results.<BR>><BR>> Garrick
wrote:<BR>> <BR>>> You *want* the nodes to hang while nfs
server is rebooting. The<BR>>> alternative is to have all apps
exit.<BR>>><BR>>> HPCC/Linux Systems Admin<BR>>><BR>>>
On Oct 23, 2008, at 7:13 AM, Mary Ellen Fitzpatrick
<mfitzpat@bu.edu><BR>>>
wrote:<BR>>><BR>>> <BR>>>> The
reason I do not want to hard mount, is that if I reboot my nfs<BR>>>>
server then it will hang, as all of the nodes will be hard mounted
to<BR>>>> that server. And I believe it is less load on the nfs server
if<BR>>>> automount servers the dirs only requested instead of all nfs
dirs.<BR>>>><BR>>>> Greenseid, Joseph M.
wrote:<BR>>>> <BR>>>>>
Why don't you want to hard mount NFS directories on the
compute<BR>>>>> nodes? What problem is this going to cause
you?<BR>>>>> --Joe<BR>>>>><BR>>>>>
________________________________<BR>>>>><BR>>>>> From:
Mary Ellen Fitzpatrick [<A
href="mailto:mfitzpat@bu.edu">mailto:mfitzpat@bu.edu</A>]<BR>>>>>
Sent: Wed 10/22/2008 3:20 PM<BR>>>>> To: Greenseid, Joseph
M.<BR>>>>> Cc: Luke Scharf;
torqueusers@supercluster.org<BR>>>>> Subject: Re: [torqueusers]
pbs_mom unable to chdir to automounted
dirs<BR>>>>><BR>>>>><BR>>>>><BR>>>>>
Good thought... How would I slow down pbs_mom, I tried putting a
sleep<BR>>>>> command in my pbs script and as Luke suggested "ls
$HOME", no<BR>>>>> dice. Do<BR>>>>> I need
to edit the pbs_mom daemon?<BR>>>>><BR>>>>> I guess
another hack would be to mount (via: cd nfsdir) the nfs dirs
on<BR>>>>> the compute nodes, but then after the automounter timed
out or<BR>>>>> reboot, I<BR>>>>> would be in the same
situation. Or to hard mount the nfs dirs (do not<BR>>>>> want
to do this!!)<BR>>>>><BR>>>>> Appreciate your
help.<BR>>>>><BR>>>>> Greenseid, Joseph M.
wrote:<BR>>>>><BR>>>>> <BR>>>>>>
I don't have a real useful suggestion, but just a thought --
could<BR>>>>>> it simply be a timing issue in that pbs_mom is
trying to stat a<BR>>>>>> file or the directory before it's been
fully mounted? It may take<BR>>>>>> a second to get the
directory mounted if it wasn't already, and<BR>>>>>> maybe PBS is
too fast for the auto-mounter, esp if the NFS server<BR>>>>>> is
under some sort of load and could be taking a little longer
to<BR>>>>>> respond than
normal?<BR>>>>>><BR>>>>>>
--Joe<BR>>>>>><BR>>>>>>
________________________________<BR>>>>>><BR>>>>>>
From: torqueusers-bounces@supercluster.org on behalf of Mary
Ellen<BR>>>>>> Fitzpatrick<BR>>>>>> Sent: Wed
10/22/2008 2:53 PM<BR>>>>>> To: Luke
Scharf<BR>>>>>> Cc:
torqueusers@supercluster.org<BR>>>>>> Subject: Re: [torqueusers]
pbs_mom unable to chdir to automounted
dirs<BR>>>>>><BR>>>>>><BR>>>>>><BR>>>>>>
The node OS is CentOS5 as is the nfs server. The pbs server
is<BR>>>>>> running<BR>>>>>> CentOS4.5. I
have rebooted and chanted... :-)
:-(<BR>>>>>><BR>>>>>> Here is my simple pbs script
and it does not have absolute paths. The<BR>>>>>> script
will run only after the nfs dirs are somehow mounted on
the<BR>>>>>> node. I have tried it with absolute path
names, and it makes no<BR>>>>>>
difference.<BR>>>>>><BR>>>>>> pbs
script:<BR>>>>>> #PBS -l nodes=node1048<BR>>>>>> #
join stderr and stdout and write the to a file<BR>>>>>> #PBS -j
oe<BR>>>>>> #PBS -o
test3.o<BR>>>>>> # cd into the
working directory<BR>>>>>> cd
$PBS_O_WORKDIR<BR>>>>>> # print out some diagnostic
stuff<BR>>>>>> echo Running on host
`hostname`<BR>>>>>> echo Directory is
`pwd`<BR>>>>>> echo Start time is `date`<BR>>>>>>
# run my commands<BR>>>>>><BR>>>>>> ./dostuff2.pl
data.txt > test3.out1<BR>>>>>><BR>>>>>> # print
out some diagnostic stuff<BR>>>>>> echo Stop time is
`date`<BR>>>>>><BR>>>>>><BR>>>>>><BR>>>>>>
Luke Scharf
wrote:<BR>>>>>><BR>>>>>> <BR>>>>>>>
If it works with the shell, however, the problem almost has to
be<BR>>>>>>> with<BR>>>>>>> something other
than the automounter.<BR>>>>>>><BR>>>>>>>
Are any asbolute paths in the qsub script
correct?<BR>>>>>>><BR>>>>>>>
-Luke<BR>>>>>>><BR>>>>>>><BR>>>>>>>
Luke Scharf
wrote:<BR>>>>>>><BR>>>>>>> <BR>>>>>>>>
That looks happy,
too.<BR>>>>>>>><BR>>>>>>>> What is the
underlying OS running on the
node?<BR>>>>>>>><BR>>>>>>>> Have you
tried just rebooting everything while muttering
laments<BR>>>>>>>> about stray alpha-particles to everyone
within earshot?<BR>>>>>>>><BR>>>>>>>>
-Luke<BR>>>>>>>><BR>>>>>>>> Mary Ellen
Fitzpatrick
wrote:<BR>>>>>>>><BR>>>>>>>> <BR>>>>>>>>>
Yeah, that is why I am stumped... because I can cd to nfs
dirs,<BR>>>>>>>>> seems like autofs is working
correctly. But unless the nfs dir is<BR>>>>>>>>>
pre-mounted, pbs_mom can not find it. Very
strange...<BR>>>>>>>>><BR>>>>>>>>>
Yes, getent passwd give the correct home dir
info<BR>>>>>>>>> [root@node1048 mom_priv]# getent
passwd<BR>>>>>>>>>
mfitzpat:!!:xxxxxx:xxx:mfitzpat:/fs/userB1/mfitzpat:/bin/bash<BR>>>>>>>>><BR>>>>>>>>>
Luke Scharf
wrote:<BR>>>>>>>>><BR>>>>>>>>> <BR>>>>>>>>>>
Nothing that you mention looks amiss at first
glance...<BR>>>>>>>>>><BR>>>>>>>>>><BR>>>>>>>>>>
Does the "getent passwd" information for the user have a
correct<BR>>>>>>>>>> home directory on the
node?<BR>>>>>>>>>><BR>>>>>>>>>>
-Luke<BR>>>>>>>>>><BR>>>>>>>>>><BR>>>>>>>>>>
Mary Ellen Fitzpatrick
wrote:<BR>>>>>>>>>><BR>>>>>>>>>> <BR>>>>>>>>>>>
Thanks Luke.<BR>>>>>>>>>>> Right now, my cluster
is one node, with additional 50 to
be<BR>>>>>>>>>>> brought on-line once I resolve
the automount problem. The job
I<BR>>>>>>>>>>> am running is very simple, no nfs
load on
server.<BR>>>>>>>>>>><BR>>>>>>>>>>>
my $usecp I believe is correct and works properly after the
nfs<BR>>>>>>>>>>> dir is
mounted.<BR>>>>>>>>>>> $usecp *:/fs/userB1
/fs/userB1<BR>>>>>>>>>>><BR>>>>>>>>>>>
My auto.home file:<BR>>>>>>>>>>> userB1
-rw,hard,intr
userB:/userB/u1<BR>>>>>>>>>>><BR>>>>>>>>>>>
auto.master file:<BR>>>>>>>>>>>
#+auto.master<BR>>>>>>>>>>>
/fs
/etc/auto.home<BR>>>>>>>>>>><BR>>>>>>>>>>>
I believe it is an automount issue and I need to tweak
a<BR>>>>>>>>>>>
parameter<BR>>>>>>>>>>> in a config file.
Not sure which one it is at this
point.<BR>>>>>>>>>>><BR>>>>>>>>>>><BR>>>>>>>>>>>
Luke Scharf
wrote:<BR>>>>>>>>>>><BR>>>>>>>>>>> <BR>>>>>>>>>>>>
Mary Ellen Fitzpatrick
wrote:<BR>>>>>>>>>>>><BR>>>>>>>>>>>> <BR>>>>>>>>>>>>>
I have my home dirs nfs exported to all of my compute nodes.
I<BR>>>>>>>>>>>>> can log into the nodes and
cd the nfs mounted dirs, no
problem.<BR>>>>>>>>>>>>> When I submit a job
to a node and the automounted nfs dirs
are<BR>>>>>>>>>>>>> not mount (timed out), I
get the following
error:<BR>>>>>>>>>>>>><BR>>>>>>>>>>>>>
Oct 21 16:08:14 node1047 pbs_mom: No such file or directory
(2)<BR>>>>>>>>>>>>> in TMomFinalizeChild,
PBS: chdir to
'/fs/userB1/mfitzpat'<BR>>>>>>>>>>>>>
failed: No such file or
directory<BR>>>>>>>>>>>>><BR>>>>>>>>>>>>>
If I immediately resubmit the job to the same node, it
will<BR>>>>>>>>>>>>> run. It appears
that pbs wants the automounted nfs dirs to
be<BR>>>>>>>>>>>>> already mounted, then the
job will run. If I hard mount
the<BR>>>>>>>>>>>>>
nfs<BR>>>>>>>>>>>>> home dirs, I have no
problem running the jobs, but I do
not<BR>>>>>>>>>>>>>
want<BR>>>>>>>>>>>>> to do
that.<BR>>>>>>>>>>>>><BR>>>>>>>>>>>>>
Any one run into this? Trying to figure out if it is a
torque<BR>>>>>>>>>>>>> issue or automount
issue.<BR>>>>>>>>>>>>><BR>>>>>>>>>>>>> <BR>>>>>>>>>>>>
How big is your cluster? How capable is the NFS server?
A<BR>>>>>>>>>>>> job-start is likely to create
a mountstorm, and generate a<BR>>>>>>>>>>>> lot
of<BR>>>>>>>>>>>> I/O. Some servers can
handle it, some
can't.<BR>>>>>>>>>>>><BR>>>>>>>>>>>>
Yay for scaling
issues!<BR>>>>>>>>>>>><BR>>>>>>>>>>>>
-Luke<BR>>>>>>>>>>>><BR>>>>>>>>>>>>
P.S. I second the suggestion of checking the $usecp
value.<BR>>>>>>>>>>>><BR>>>>>>>>>>>> <BR>>>>>>
--<BR>>>>>> Thanks<BR>>>>>> Mary
Ellen<BR>>>>>><BR>>>>>>
_______________________________________________<BR>>>>>>
torqueusers mailing list<BR>>>>>>
torqueusers@supercluster.org<BR>>>>>> <A
href="http://www.supercluster.org/mailman/listinfo/torqueusers">http://www.supercluster.org/mailman/listinfo/torqueusers</A><BR>>>>>><BR>>>>>><BR>>>>>><BR>>>>>><BR>>>>>> <BR>>>>>
--<BR>>>>> Thanks<BR>>>>> Mary
Ellen<BR>>>>><BR>>>>><BR>>>>><BR>>>>><BR>>>>><BR>>>>> <BR>>>>
--<BR>>>> Thanks<BR>>>> Mary
Ellen<BR>>>><BR>>>>
_______________________________________________<BR>>>> torqueusers
mailing list<BR>>>> torqueusers@supercluster.org<BR>>>> <A
href="http://www.supercluster.org/mailman/listinfo/torqueusers">http://www.supercluster.org/mailman/listinfo/torqueusers</A><BR>>>> <BR>><BR>> <BR><BR>--<BR>Thanks<BR>Mary
Ellen<BR><BR></FONT></P></DIV></BODY></HTML>