Hi Gus,<br>In default, I can submit job on nodes.<br>Now I still get the same errors as below when my pbs script tried to resubmit jobs.<br>/var/spool/torque/mom_priv/jobs/<a href="http://127.master.SC">127.master.SC</a>: line 13: qsub: command not found<br>
<br>It seems "qsub" cannot be recognized in pbs_script. However, if I use /usr/local/bin/qsub, my script works successfully.<br><br>So how I can let pbs_script know the path of qsub?<br><br><br>Cheers,<br>Shibo Kuang<br>
<br><div class="gmail_quote">On Thu, Mar 11, 2010 at 10:12 AM, Gus Correa <span dir="ltr"><<a href="mailto:gus@ldeo.columbia.edu">gus@ldeo.columbia.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi Shibo<br>
<br>
Sorry, I forgot this important step.<br>
On your master node do this (you may need to do this as root,<br>
or using "su" or "sudo", unless the user shibo is also a Torque<br>
administrator):<br>
<br>
qmgr -c "set server allow_node_submit = True"<br>
<br>
to allow jobs to be submitted from all nodes,<br>
not only from the master.<br>
<br>
To confirm that the server configuration changed,<br>
do:<br>
<br>
qmgr -c "print server"<br>
<br>
<br>
Also:<br>
<br>
1) From what you say, it looks like your qsub is in /usr/local/bin/qsub,<br>
not in /var/spool/torque/bin (my wrong guess).<br>
2) There are no torque.sh and torque.csh files in /etc/profile.d.<br>
You would need to *create* them.<br>
However, this may not be necessary, as your Torque qsub command is<br>
installed on /usr/local/bin, which is likely to be in your PATH already.<div class="im"><br>
<br>
I hope this helps.<br>
Gus Correa<br>
---------------------------------------------------------------------<br>
Gustavo Correa<br>
Lamont-Doherty Earth Observatory - Columbia University<br>
Palisades, NY, 10964-8000 - USA<br>
---------------------------------------------------------------------<br>
<br>
<br>
shibo kuang wrote:<br>
</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="im">
Dear Gus,<br>
thanks for your reply.<br>
I am trying moving from windows to linux to do simulations, and thus not familar with linux things.<br>
resubmission is not working both on master and node although the submission works one time for both.<br>
when I run "which qsub" on master and node, both get "/usr/local/bin/qsub".<br>
using export to set the parth ( export PATH=/usr/local/bin:${PATH}) is not working. I cannot find torque.sh, thus cannot test the second method suggested. there is no the folder "/var/spool/torque/bin". Insteresting, in /var/spool/torque/pbs_environment, it gives "PATH=/bin:/usr/bin"<br>
thanks again and your further suggestions would be greatly appreciated.<br>
Cheers,<br>
Shibo kuang<br>
<br></div><div class="im">
On Thu, Mar 11, 2010 at 4:18 AM, Gus Correa <<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a> <mailto:<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a>>> wrote:<br>
<br>
Hi Shibo<br>
<br></div><div><div></div><div class="h5">
Glad that your Torque/PBS is now working.<br>
<br>
I would guess the problem you have now with job resubmission<br>
is related to your PATH environment variable.<br>
Somehow Linux cannot find qsub, and I suppose this happens in the<br>
slave node.<br>
<br>
Does it happen in the master node also?<br>
What do you get if you login to the slave node and do "which qsub",<br>
or just "qsub"?<br>
<br>
Again, this is not a Torque problem, more of a Sys Admin issue.<br>
A possible fix may depend a bit on where you installed Torque.<br>
Assuming it is installed in /var/spool/torque/,<br>
add /var/spool/torque/bin to your path,<br>
on your shell initialization script:<br>
<br>
For csh/tcsh, in your .cshrc/.tcshrc<br>
<br>
setenv PATH /var/spool/torque/bin:${PATH}<br>
<br>
For sh/bash in .profile or maybe .bashrc<br>
<br>
export PATH=/var/spool/torque/bin:${PATH}<br>
<br>
An alternative is to add a torque.sh and a torque.csh file<br>
to the /etc/profile.d directory *on every node* with the<br>
contents above.<br>
(This may depend a bit on which Linux distribution you use.<br>
It works for Fedora, RedHat, and CentOS, may work for others too.)<br>
<br>
<br>
I hope this helps.<br>
<br>
Gus Correa<br>
---------------------------------------------------------------------<br>
Gustavo Correa<br>
Lamont-Doherty Earth Observatory - Columbia University<br>
Palisades, NY, 10964-8000 - USA<br>
---------------------------------------------------------------------<br>
<br>
shibo kuang wrote:<br>
<br>
Hi All,<br>
Now my pbs server can work with the help of Gus Correa. My<br>
problem is due to the fact that I did mount my master folder to<br>
nodes. Here, i got another problem for automatically restarting<br>
a job.<br>
Below is my script<br>
#!/bin/bash<br>
#PBS -N inc90<br>
#PBS -q short<br>
#PBS -l walltime=00:08:00<br>
cd $PBS_O_WORKDIR<br>
./nspff >out<br>
if [ -f jobfinished ]; then<br>
rm -f jobfinished<br>
exit 0<br>
fi<br>
sleep 10<br>
qsub case<br>
my code stops at 7min, it is supposed to get started<br>
automatically after 10s, but failed with the following error:<br>
/var/spool/torque/mom_priv/jobs/<a href="http://120.master.SC" target="_blank">120.master.SC</a><br></div></div>
<<a href="http://120.master.sc/" target="_blank">http://120.master.sc/</a>> <<a href="http://120.master.SC" target="_blank">http://120.master.SC</a><br>
<<a href="http://120.master.sc/" target="_blank">http://120.master.sc/</a>>>: line 13: qsub: command not found<div><div></div><div class="h5"><br>
<br>
Your help would be greatly appreciated.<br>
Regards,<br>
Shibo Kuang<br>
<br>
On Wed, Mar 10, 2010 at 2:57 AM, Gus Correa<br>
<<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a> <mailto:<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a>><br>
<mailto:<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a><br>
<mailto:<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a>>>> wrote:<br>
<br>
Hi Shibo<br>
<br>
Somehow your "slave" computer<br>
doesn't see /home/kuang/sharpbend/s1/r8,<br>
although it can be seen by the "master" computer.<br>
It may be one of several things,<br>
it is hard to tell exactly with the information you gave,<br>
but here are some guesses.<br>
<br>
Do you really have a separate /home/kuang/sharpbend/s1/r8<br>
on your "slave" computer, or is it only present in the<br>
"master"?<br>
You can login to the "slave" and check this directly<br>
("ls home/kuang/sharpbend/s1/r8").<br>
If the directory is not there,<br>
this is not really a Torque or MPI problem,<br>
but a Sys Admin problem with exporting and mounting<br>
directories.<br>
<br>
If that directory exists only on the master side,<br>
you can either create an identical copy on the "slave" side<br>
(painful),<br>
or use NFS to export it from the "master" computer to the<br>
"slave" (easier).<br>
<br>
For the second approach, you need to export the /home or<br>
/home/kuang<br>
on the "master" computer, and automount it on the "slave"<br>
computer.<br>
The files you need to edit are /etc/exports (master side),<br>
and /etc/auto.master plus perhaps /etc/auto.home (slave<br>
side).<br>
<br>
A bit different approach (not using the automounter),<br>
is just to hard mount /home or /home/kuang<br>
on the "slave" side by adding it to the /etc/fstab list.<br>
<br>
You also need to turn on the NFS daemon on the "master"<br>
node with<br>
"chkconfig", if it is not yet turned on.<br>
<br>
Read the man pages!<br>
At least read "man exportfs", "man mountd", "man fstab",<br>
and "man chkconfig".<br>
<br>
You may need to reboot the computers for this to take effect.<br>
Then login to the "slave" and try again<br>
"ls home/kuang/sharpbend/s1/r8".<br>
<br>
I hope this helps.<br>
Gus Correa<br>
---------------------------------------------------------------------<br>
Gustavo Correa<br>
Lamont-Doherty Earth Observatory - Columbia University<br>
Palisades, NY, 10964-8000 - USA<br>
---------------------------------------------------------------------<br>
<br>
shibo kuang wrote:<br>
<br>
"/home/kuang/sharpbend/s1/r8: No such file or directory."<br>
my node does not have the directory, but my master<br>
has it.<br>
On Sun, Mar 7, 2010 at 1:09 AM, shibo kuang<br>
<<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a> <mailto:<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a>><br>
<mailto:<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a> <mailto:<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a>>><br>
<mailto:<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a><br>
<mailto:<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a>> <mailto:<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a><br>
<mailto:<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a>>>>><br>
<br>
wrote:<br>
<br>
Hi,<br>
I just fix the problem using password free<br>
between the<br>
computing<br>
node and the master.<br>
But now i got another problem:<br>
in r8.e19, it says<br>
/home/kuang/sharpbend/s1/r8: No such file or<br>
directory.<br>
if only one computer is used, the sever can work<br>
normally.<br>
Where is missed by me when I install the torque?<br>
Your help would be greatly appreciated.<br>
Cheers,<br>
Shibo Kuang<br>
<br>
<br>
On Sun, Mar 7, 2010 at 12:46 AM, shibo kuang<br>
<<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a> <mailto:<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a>><br>
<mailto:<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a> <mailto:<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a>>><br>
<mailto:<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a><br>
<mailto:<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a>><br>
<mailto:<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a><br></div></div><div><div></div><div class="h5">
<mailto:<a href="mailto:s.b.kuang@gmail.com" target="_blank">s.b.kuang@gmail.com</a>>>>> wrote:<br>
<br>
Hi all,<br>
I tried to install a pbs server for my two<br>
centos linux<br>
computers (each have 8 cores), but failed..<br>
Here is my problem:<br>
if i treat one computer as master for runnig<br>
pbs_server, as well<br>
as a computing node. I can submit jobs using<br>
script<br>
without any<br>
problem. All jobs give the exact results. However, when one computer is treated as a<br>
master, and<br>
another is a compting node. jobs ara never<br>
submitted<br>
sucessfully.<br>
I would appreciate your hints and suggestions<br>
according the<br>
following prompts i got.<br>
Regards,<br>
Shibo Kuang<br>
Return-Path: <adm@master<br>
<mailto:<a href="mailto:adm@master" target="_blank">adm@master</a> <mailto:<a href="mailto:adm@master" target="_blank">adm@master</a>><br>
<mailto:<a href="mailto:adm@master" target="_blank">adm@master</a> <mailto:<a href="mailto:adm@master" target="_blank">adm@master</a>>>>><br>
<br>
Received: from master (localhost [127.0.0.1])<br>
by master (8.13.1/8.13.1) with ESMTP id<br>
o26DwKF9006310<br>
for <kuang@master <mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a><br>
<mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a>><br>
<mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a> <mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a>>>>>; Sun, 7 Mar<br>
<br>
2010 00:28:20 +1030<br>
Received: (from root@localhost<br>
<mailto:<a href="mailto:root@localhost" target="_blank">root@localhost</a> <mailto:<a href="mailto:root@localhost" target="_blank">root@localhost</a>><br>
<mailto:<a href="mailto:root@localhost" target="_blank">root@localhost</a> <mailto:<a href="mailto:root@localhost" target="_blank">root@localhost</a>>>>)<br>
<br>
by master (8.13.1/8.13.1/Submit) id<br>
o26DwKpZ006293<br>
for kuang@master <mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a><br>
<mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a>><br>
<mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a> <mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a>>>>; Sun, 7<br>
Mar 2010<br>
<br>
00:28:20 +1030<br>
Date: Sun, 7 Mar 2010 00:28:20 +1030<br>
From: adm <adm@master <mailto:<a href="mailto:adm@master" target="_blank">adm@master</a><br>
<mailto:<a href="mailto:adm@master" target="_blank">adm@master</a>><br>
<mailto:<a href="mailto:adm@master" target="_blank">adm@master</a> <mailto:<a href="mailto:adm@master" target="_blank">adm@master</a>>>>><br>
<br>
Message-Id: <201003061358.o26DwKpZ006293@master<br>
<mailto:<a href="mailto:201003061358.o26DwKpZ006293@master" target="_blank">201003061358.o26DwKpZ006293@master</a><br>
<mailto:<a href="mailto:201003061358.o26DwKpZ006293@master" target="_blank">201003061358.o26DwKpZ006293@master</a>><br>
<mailto:<a href="mailto:201003061358.o26DwKpZ006293@master" target="_blank">201003061358.o26DwKpZ006293@master</a><br>
<mailto:<a href="mailto:201003061358.o26DwKpZ006293@master" target="_blank">201003061358.o26DwKpZ006293@master</a>>>>><br>
To: kuang@master <mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a><br>
<mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a>><br>
<mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a> <mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a>>>><br>
<br>
Subject: PBS JOB 18.master<br>
Precedence: bulk<br>
PBS Job Id: 18.master<br>
Job Name: r8<br>
Exec host: par1/0<br>
An error has occurred processing your job, see<br>
below.<br>
Post job file processing error; job 18.master<br>
on host<br>
par1/0<br>
Unable to copy file<br>
/var/spool/torque/spool/18.master.OU to<br>
kuang@master:/home/kuang/sharpbend/s1/r8/r8.o18<br>
<mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a> <mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a>><br>
<mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a><br>
<mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a>>>:/home/kuang/sharpbend/s1/r8/r8.o18><br>
<br>
*** error from copy<br>
Permission denied<br>
(publickey,gssapi-with-mic,password).<br>
lost connection<br>
*** end error output<br>
Output retained on that host in:<br>
/var/spool/torque/undelivered/18.master.OU<br>
Unable to copy file<br>
/var/spool/torque/spool/<a href="http://18.master.ER" target="_blank">18.master.ER</a><br>
<<a href="http://18.master.er/" target="_blank">http://18.master.er/</a>> <<a href="http://18.master.er/" target="_blank">http://18.master.er/</a>><br>
<<a href="http://18.master.er/" target="_blank">http://18.master.er/</a>> to<br>
<br>
kuang@master:/home/kuang/sharpbend/s1/r8/r8.e18<br>
<mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a> <mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a>><br>
<mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a><br>
<mailto:<a href="mailto:kuang@master" target="_blank">kuang@master</a>>>:/home/kuang/sharpbend/s1/r8/r8.e18><br>
<br>
*** error from copy<br>
Permission denied<br>
(publickey,gssapi-with-mic,password).<br>
lost connection<br>
*** end error output<br>
Output retained on that host in:<br>
/var/spool/torque/undelivered/<a href="http://18.master.ER" target="_blank">18.master.ER</a><br>
<<a href="http://18.master.er/" target="_blank">http://18.master.er/</a>><br>
<<a href="http://18.master.er/" target="_blank">http://18.master.er/</a>> <<a href="http://18.master.er/" target="_blank">http://18.master.er/</a>><br>
<br>
<br>
<br>
<br>
------------------------------------------------------------------------<br>
<br>
<br>
<br>
_______________________________________________<br>
torqueusers mailing list<br>
<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a><br>
<mailto:<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a>><br></div></div>
<mailto:<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a><div class="im"><br>
<mailto:<a href="mailto:torqueusers@supercluster.org" target="_blank">torqueusers@supercluster.org</a>>><br>
<br>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
<br>
<br>
<br>
<br>
<br>
<br>
</div></blockquote>
<br>
</blockquote></div><br>