<HTML dir=ltr><HEAD><TITLE>Re: [torqueusers] jobs stuck in Q</TITLE>
<META http-equiv=Content-Type content="text/html; charset=unicode">
<META content="MSHTML 6.00.2900.3429" name=GENERATOR></HEAD>
<BODY>
<DIV id=idOWAReplyText55449 dir=ltr>
<DIV dir=ltr><FONT face=Arial color=#000000 size=2>sorry, i didn't even think; this won't work with the server not running...</FONT></DIV></DIV>
<DIV dir=ltr><BR>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> torqueusers-bounces@supercluster.org on behalf of Greenseid, Joseph M.<BR><B>Sent:</B> Mon 12/15/2008 1:58 PM<BR><B>To:</B> Adrian Sevcenco<BR><B>Cc:</B> torqueusers@supercluster.org<BR><B>Subject:</B> RE: [torqueusers] jobs stuck in Q<BR></FONT><BR></DIV>
<DIV dir=ltr>
<DIV id=idOWAReplyText60523 dir=ltr>
<DIV dir=ltr><FONT face=Arial color=#000000 size=2>what does `checkjob 2` show you (where 2 is the jobid, as taken from your first email)?</FONT></DIV>
<DIV dir=ltr><FONT face=Arial size=2></FONT> </DIV>
<DIV dir=ltr><FONT face=Arial size=2>--Joe</FONT></DIV></DIV>
<DIV dir=ltr><BR>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> torqueusers-bounces@supercluster.org on behalf of Adrian Sevcenco<BR><B>Sent:</B> Mon 12/15/2008 12:13 PM<BR><B>To:</B> Greenseid, Joseph M.<BR><B>Cc:</B> torqueusers@supercluster.org<BR><B>Subject:</B> Re: [torqueusers] jobs stuck in Q<BR></FONT><BR></DIV>
<DIV>
<P><FONT size=2>Greenseid, Joseph M. wrote:<BR>> what scheduler are you using? are you using torque's scheduler, or<BR>> maui, or something else?<BR>Hi,<BR>I am using maui .. do you think that the problem can be there?<BR>Now i see that when i try to restart maui.cfg i have :<BR>ERROR: lost connection to server<BR>ERROR: cannot request service (status)<BR>i have this as maui.cfg<BR>[root@grid01 maui]# cat maui.cfg<BR># MAUI configuration example<BR><BR>SERVERHOST grid01.spacescience.ro<BR>ADMIN1 root<BR>ADMIN3 edginfo rgma edguser<BR>ADMINHOSTS grid01.spacescience.ro<BR>RMCFG[base] TYPE=PBS<BR>SERVERPORT 40559<BR>SERVERMODE NORMAL<BR><BR># Set PBS server polling interval. If you have short # queues or/and<BR>jobs it is worth to set a short interval. (10 seconds)<BR><BR>RMPOLLINTERVAL 00:00:10<BR><BR># a max. 10 MByte log file in a logical location<BR><BR>LOGFILE /var/log/maui.log<BR>LOGFILEMAXSIZE 10000000<BR>LOGLEVEL 1<BR><BR># Set the delay to 1 minute before Maui tries to run a job again, # in<BR>case it failed to run the first time.<BR># The default value is 1 hour.<BR><BR>DEFERTIME 00:01:00<BR><BR># Necessary for MPI grid jobs<BR>ENABLEMULTIREQJOBS TRUE<BR><BR>Any idea anyone ?<BR>Thanks,<BR>Adrian<BR><BR><BR>> --Joe<BR>><BR>> ------------------------------------------------------------------------<BR>> *From:* torqueusers-bounces@supercluster.org on behalf of Adrian Sevcenco<BR>> *Sent:* Mon 12/15/2008 10:40 AM<BR>> *To:* torqueusers@supercluster.org<BR>> *Subject:* [torqueusers] jobs stuck in Q<BR>><BR>> Hi,<BR>> I have a server in which jobs are stucked in queue. i have this output<BR>> from qstat -f :<BR>> Job Id: 2.grid01.spacescience.ro<BR>> Job_Name = STDIN<BR>> Job_Owner = alice001@grid01.spacescience.ro<BR>> job_state = Q<BR>> queue = alice<BR>> server = grid01.spacescience.ro<BR>> Checkpoint = u<BR>> ctime = Mon Dec 15 17:19:39 2008<BR>> Error_Path = grid01.spacescience.ro:/home/alice001/STDIN.e2<BR>> Hold_Types = n<BR>> Join_Path = n<BR>> Keep_Files = n<BR>> Mail_Points = a<BR>> mtime = Mon Dec 15 17:20:22 2008<BR>> Output_Path = grid01.spacescience.ro:/home/alice001/STDIN.o2<BR>> Priority = 0<BR>> qtime = Mon Dec 15 17:20:49 2008<BR>> Rerunable = True<BR>> Resource_List.cput = 48:00:00<BR>> Resource_List.walltime = 72:00:00<BR>> Variable_List = PBS_O_HOME=/home/alice001,PBS_O_LANG=en_US.UTF-8,<BR>> PBS_O_LOGNAME=alice001,<BR>><BR>> PBS_O_PATH=/usr/kerberos/bin:/opt/edg/bin:/opt/glite/bin:/opt/lcg/bin<BR>> :/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/alice001/bin,<BR>> PBS_O_MAIL=/var/spool/mail/alice001,PBS_O_SHELL=/bin/bash,<BR>> PBS_SERVER=grid01.spacescience.ro,PBS_O_HOST=grid01.spacescience.ro,<BR>> PBS_O_WORKDIR=/home/alice001,PBS_O_QUEUE=alice<BR>> etime = Mon Dec 15 17:20:49 2008<BR>> submit_args = -q alice<BR>><BR>> and a momctl on a wn gives me this :<BR>> [root@grid01 ~]# momctl -d 3 -h wn01<BR>><BR>> Host: wn01.spacescience.ro/wn01.spacescience.ro Version:<BR>> 2.3.0-snap.200801151629 PID: 7248<BR>> Server[0]: grid01.spacescience.ro (172.16.0.254)<BR>> Init Msgs Received: 0 hellos/1 cluster-addrs<BR>> Init Msgs Sent: 1 hellos<BR>> Last Msg From Server: 284242 seconds (CLUSTER_ADDRS)<BR>> Last Msg To Server: 21 seconds<BR>> HomeDirectory: /var/spool/pbs/mom_priv<BR>> stdout/stderr spool directory: '/var/spool/pbs/spool/' (1072793 blocks<BR>> available)<BR>> NOTE: syslog enabled<BR>> MOM active: 284244 seconds<BR>> Server Update Interval: 45 seconds<BR>> LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)<BR>> Communication Model: RPP<BR>> MemLocked: TRUE (mlock)<BR>> TCP Timeout: 20 seconds<BR>> Prolog: /var/spool/pbs/mom_priv/prologue (disabled)<BR>> Alarm Time: 0 of 10 seconds<BR>> Trusted Client List:<BR>> 172.16.0.5,172.16.0.4,172.16.0.3,172.16.0.2,172.16.0.254,172.16.0.1,127.0.0.1<BR>> Copy Command: /usr/bin/scp -rpB<BR>> NOTE: no local jobs detected<BR>><BR>> diagnostics complete<BR>><BR>> What can be wrong and where should i look into ?<BR>> Thanks for any help,<BR>> Adrian<BR>><BR></FONT></P></DIV></DIV></BODY></HTML>