Check if pbs_sched is running or not. Also check that iptables should be turned off on pbs_server<br><br><div class="gmail_quote">On Mon, Dec 15, 2008 at 10:43 PM, Adrian Sevcenco <span dir="ltr"><Adrian.Sevcenco@cern.ch></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d">Greenseid, Joseph M. wrote:<br>
> what scheduler are you using? are you using torque's scheduler, or<br>
> maui, or something else?<br>
</div>Hi,<br>
I am using maui .. do you think that the problem can be there?<br>
Now i see that when i try to restart maui.cfg i have :<br>
ERROR: lost connection to server<br>
ERROR: cannot request service (status)<br>
i have this as maui.cfg<br>
[root@grid01 maui]# cat maui.cfg<br>
# MAUI configuration example<br>
<br>
SERVERHOST <a href="http://grid01.spacescience.ro" target="_blank">grid01.spacescience.ro</a><br>
ADMIN1 root<br>
ADMIN3 edginfo rgma edguser<br>
ADMINHOSTS <a href="http://grid01.spacescience.ro" target="_blank">grid01.spacescience.ro</a><br>
RMCFG[base] TYPE=PBS<br>
SERVERPORT 40559<br>
SERVERMODE NORMAL<br>
<br>
# Set PBS server polling interval. If you have short # queues or/and<br>
jobs it is worth to set a short interval. (10 seconds)<br>
<br>
RMPOLLINTERVAL 00:00:10<br>
<br>
# a max. 10 MByte log file in a logical location<br>
<br>
LOGFILE /var/log/maui.log<br>
LOGFILEMAXSIZE 10000000<br>
LOGLEVEL 1<br>
<br>
# Set the delay to 1 minute before Maui tries to run a job again, # in<br>
case it failed to run the first time.<br>
# The default value is 1 hour.<br>
<br>
DEFERTIME 00:01:00<br>
<br>
# Necessary for MPI grid jobs<br>
ENABLEMULTIREQJOBS TRUE<br>
<br>
Any idea anyone ?<br>
Thanks,<br>
<font color="#888888">Adrian<br>
</font><div><div></div><div class="Wj3C7c"><br>
<br>
> --Joe<br>
><br>
> ------------------------------------------------------------------------<br>
> *From:* <a href="mailto:torqueusers-bounces@supercluster.org">torqueusers-bounces@supercluster.org</a> on behalf of Adrian Sevcenco<br>
> *Sent:* Mon 12/15/2008 10:40 AM<br>
> *To:* <a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a><br>
> *Subject:* [torqueusers] jobs stuck in Q<br>
><br>
> Hi,<br>
> I have a server in which jobs are stucked in queue. i have this output<br>
> from qstat -f :<br>
> Job Id: <a href="http://2.grid01.spacescience.ro" target="_blank">2.grid01.spacescience.ro</a><br>
> Job_Name = STDIN<br>
> Job_Owner = <a href="mailto:alice001@grid01.spacescience.ro">alice001@grid01.spacescience.ro</a><br>
> job_state = Q<br>
> queue = alice<br>
> server = <a href="http://grid01.spacescience.ro" target="_blank">grid01.spacescience.ro</a><br>
> Checkpoint = u<br>
> ctime = Mon Dec 15 17:19:39 2008<br>
> Error_Path = grid01.spacescience.ro:/home/alice001/STDIN.e2<br>
> Hold_Types = n<br>
> Join_Path = n<br>
> Keep_Files = n<br>
> Mail_Points = a<br>
> mtime = Mon Dec 15 17:20:22 2008<br>
> Output_Path = grid01.spacescience.ro:/home/alice001/STDIN.o2<br>
> Priority = 0<br>
> qtime = Mon Dec 15 17:20:49 2008<br>
> Rerunable = True<br>
> Resource_List.cput = 48:00:00<br>
> Resource_List.walltime = 72:00:00<br>
> Variable_List = PBS_O_HOME=/home/alice001,PBS_O_LANG=en_US.UTF-8,<br>
> PBS_O_LOGNAME=alice001,<br>
><br>
> PBS_O_PATH=/usr/kerberos/bin:/opt/edg/bin:/opt/glite/bin:/opt/lcg/bin<br>
> :/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/alice001/bin,<br>
> PBS_O_MAIL=/var/spool/mail/alice001,PBS_O_SHELL=/bin/bash,<br>
> PBS_SERVER=<a href="http://grid01.spacescience.ro" target="_blank">grid01.spacescience.ro</a>,PBS_O_HOST=<a href="http://grid01.spacescience.ro" target="_blank">grid01.spacescience.ro</a>,<br>
> PBS_O_WORKDIR=/home/alice001,PBS_O_QUEUE=alice<br>
> etime = Mon Dec 15 17:20:49 2008<br>
> submit_args = -q alice<br>
><br>
> and a momctl on a wn gives me this :<br>
> [root@grid01 ~]# momctl -d 3 -h wn01<br>
><br>
> Host: <a href="http://wn01.spacescience.ro/wn01.spacescience.ro" target="_blank">wn01.spacescience.ro/wn01.spacescience.ro</a> Version:<br>
> 2.3.0-snap.200801151629 PID: 7248<br>
> Server[0]: <a href="http://grid01.spacescience.ro" target="_blank">grid01.spacescience.ro</a> (<a href="http://172.16.0.254" target="_blank">172.16.0.254</a>)<br>
> Init Msgs Received: 0 hellos/1 cluster-addrs<br>
> Init Msgs Sent: 1 hellos<br>
> Last Msg From Server: 284242 seconds (CLUSTER_ADDRS)<br>
> Last Msg To Server: 21 seconds<br>
> HomeDirectory: /var/spool/pbs/mom_priv<br>
> stdout/stderr spool directory: '/var/spool/pbs/spool/' (1072793 blocks<br>
> available)<br>
> NOTE: syslog enabled<br>
> MOM active: 284244 seconds<br>
> Server Update Interval: 45 seconds<br>
> LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)<br>
> Communication Model: RPP<br>
> MemLocked: TRUE (mlock)<br>
> TCP Timeout: 20 seconds<br>
> Prolog: /var/spool/pbs/mom_priv/prologue (disabled)<br>
> Alarm Time: 0 of 10 seconds<br>
> Trusted Client List:<br>
> <a href="http://172.16.0.5" target="_blank">172.16.0.5</a>,<a href="http://172.16.0.4" target="_blank">172.16.0.4</a>,<a href="http://172.16.0.3" target="_blank">172.16.0.3</a>,<a href="http://172.16.0.2" target="_blank">172.16.0.2</a>,<a href="http://172.16.0.254" target="_blank">172.16.0.254</a>,<a href="http://172.16.0.1" target="_blank">172.16.0.1</a>,<a href="http://127.0.0.1" target="_blank">127.0.0.1</a><br>
> Copy Command: /usr/bin/scp -rpB<br>
> NOTE: no local jobs detected<br>
><br>
> diagnostics complete<br>
><br>
> What can be wrong and where should i look into ?<br>
> Thanks for any help,<br>
> Adrian<br>
><br>
</div></div><br>_______________________________________________<br>
torqueusers mailing list<br>
<a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a><br>
<a href="http://www.supercluster.org/mailman/listinfo/torqueusers" target="_blank">http://www.supercluster.org/mailman/listinfo/torqueusers</a><br>
<br></blockquote></div><br><br clear="all"><br>-- <br>Regards--<br>Rishi Pathak<br>Pune-Maharastra<br>