[torqueusers] Jobs stay in Q state w/ all nodes free; server logs show connect() failed [Torque 4.2.3.1]

David Beer dbeer at adaptivecomputing.com
Tue Jul 23 09:47:57 MDT 2013


I'm just thinking that if you say the job runs fine when a qrun is executed
but the scheduler doesn't start them, you probably want to look into why
the scheduler isn't scheduling them. I don't know how to debug pbs_sched
(or even how to begin). Are you planning to go into production with Maui?
If you are, I would try that out and see if it gives you any problems.

David


On Tue, Jul 23, 2013 at 8:15 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:

> Hi David
>
> As I said, just to test functionality for now I am using pbs_sched.
> I will install Maui later, once Torque gets to work right.
>
> Yesterday's scheduler log is below.
> I haven't submitted a job today.
>
> If there is a simple solution, please let me know.
> Otherwise, I may need to try an older Torque version.
> This is a machine waiting to enter production.
>
> Thank you,
> Gus Correa
>
> 07/22/2013 18:48:38;0002; pbs_sched.11810;Svr;Log;Log opened
> 07/22/2013 18:48:38;0002; pbs_sched.11810;Svr;TokenAct;Account file
> /opt/torque/4.2.3.1/gnu-4.4.7/sched_priv/accounting/20130722 opened
> 07/22/2013 18:48:38;0002;
> pbs_sched.11811;Svr;main;/opt/torque/4.2.3.1/gnu-4.4.7/sbin/pbs_sched
> startup pid 11811
> 07/22/2013 18:48:39;0080; pbs_sched.11811;Svr;main;brk point 29609984
> 07/22/2013 18:49:45;0080; pbs_sched.11811;Svr;main;brk point 29872128
> 07/22/2013 19:02:27;0002; pbs_sched.11811;Svr;die;caught signal 15
> 07/22/2013 19:02:27;0002; pbs_sched.11811;Svr;Log;Log closed
> 07/22/2013 19:02:27;0002; pbs_sched.13314;Svr;Log;Log opened
> 07/22/2013 19:02:27;0002; pbs_sched.13314;Svr;TokenAct;Account file
> /opt/torque/4.2.3.1/gnu-4.4.7/sched_priv/accounting/20130722 opened
> 07/22/2013 19:02:27;0002;
> pbs_sched.13315;Svr;main;/opt/torque/4.2.3.1/gnu-4.4.7/sbin/pbs_sched
> startup pid 13315
> 07/22/2013 19:03:01;0080; pbs_sched.13315;Svr;main;brk point 40480768
> 07/22/2013 19:04:07;0080; pbs_sched.13315;Svr;main;brk point 40742912
>
>
> On 07/23/2013 01:00 AM, David Beer wrote:
> > Gus,
> >
> > What scheduler are you using? What do your scheduler logs say?
> >
> > David
> >
> >
> > On Mon, Jul 22, 2013 at 5:53 PM, Gus Correa <gus at ldeo.columbia.edu
> > <mailto:gus at ldeo.columbia.edu>> wrote:
> >
> >     Sorry. The subject line should read Torque 4.2.3.1, of course.
> >
> >     On 07/22/2013 07:50 PM, Gus Correa wrote:
> >      > Hello Torque experts
> >      >
> >      > I am trying Torque 4.2.3.1,
> >      > just with pbs_sched for the intial testing.
> >      > pbsnodes shows all nodes "free".
> >      > However, if I submit a job (simple, serial, hostname only),
> >      > the job stays in Q state forever, and only runs with qrun.
> >      > The server log shows messages like these:
> >      >
> >      > ******************************************************************
> >      > 07/22/2013
> >      >
> >     19:37:40;0001;PBS_Server.13236;Svr;PBS_Server;LOG_ERROR::Operation
> now
> >      > in progress (115) in tcp_connect_sockaddr, Failed when trying to
> open
> >      > tcp connection - connect() failed [rc = 15096] [addr =
> >     10.10.1.8:15003 <http://10.10.1.8:15003>]
> >      >
> >      > 07/22/2013
> >      >
> >
> 19:37:40;0001;PBS_Server.13236;Svr;PBS_Server;LOG_ERROR::send_hierarchy,
> >      > Could not send mom hierarchy to host node08:15003
> >      > ******************************************************************
> >      >
> >      > ... and goes on and on for the various nodes.
> >      >
> >      > I already restarted the server, the moms, and the scheduler
> several
> >      > times, but yanking them doesn't seem to do the trick.
> >      >
> >      > I found similar error reports in the mailing list,
> >      > but no clear solution.
> >      > Is there any?
> >      > Better use an older version of Torque?
> >      > Which one is free from this error?
> >      >
> >      > Thank you for your help,
> >      > Gus Correa
> >      > _______________________________________________
> >      > torqueusers mailing list
> >      > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org
> >
> >      > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >     _______________________________________________
> >     torqueusers mailing list
> >     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
> >     http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> >
> > --
> > David Beer | Senior Software Engineer
> > Adaptive Computing
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130723/907f20fa/attachment-0001.html 


More information about the torqueusers mailing list