[torqueusers] Jobs stay in Q state w/ all nodes free; server logs show connect() failed [Torque 4.2.3.1]

Gus Correa gus at ldeo.columbia.edu
Tue Jul 23 11:48:39 MDT 2013


Thank you, David.

I followed your suggestion, built and installed maui 3.3.1,
stopped pbs_sched, started maui.

Now the simple serial "hostname" type of jobs run when submitted,
no need for qrun.
I still need to try parallel jobs.

There remain a bunch of questions, to me at least:

1) Is pbs_sched phased out in Torque 4.X series?

At least in 4.3.2.1 it doesn't seem to work, as per all that
I reported on this thread.
However, it is (or used to be) good enough for small clusters.
I used it for quite a while in small production clusters.
It is (was?) also a good tool to test if Torque is working.

2) Why am I getting these
"undefined symbol: _Z14read_ac_socketiPvl"
errors in the secure logs in the compute nodes?

*****************
Jul 22 19:10:51 node08 sshd[6845]: PAM unable to
dlopen(/lib64/security/pam_pbssimpleauth.so):
/lib64/security/pam_pbssimpleauth.so: undefined symbol:
_Z14read_ac_socketiPvl
Jul 22 19:10:51 node08 sshd[6845]: PAM adding faulty module:
/lib64/security/pam_pbssimpleauth.so
******************

Is pam_pbssimpleauth broken-built somehow in Torque 4.3.2.1?
I want to use it, so this is concerning.

**

3) Why does the self-extracting torque-pam package install
pam-pbssimpleauth.* in /lib/security, instead of /lib64/security
in x86_64 system?

[Odd, because "make install" puts the pam libraries in the righg place,
/lib64/security.]

**

I still have to build OpenMPI on top of Torque and try parallel jobs.
However, I remain a bit worried of using the 4.3.2.1 version,
given the errors I reported.

Thank you,
Gus Correa

On 07/23/2013 11:47 AM, David Beer wrote:
> I'm just thinking that if you say the job runs fine when a qrun is
> executed but the scheduler doesn't start them, you probably want to look
> into why the scheduler isn't scheduling them. I don't know how to debug
> pbs_sched (or even how to begin). Are you planning to go into production
> with Maui? If you are, I would try that out and see if it gives you any
> problems.
>
> David
>
>
> On Tue, Jul 23, 2013 at 8:15 AM, Gus Correa <gus at ldeo.columbia.edu
> <mailto:gus at ldeo.columbia.edu>> wrote:
>
>     Hi David
>
>     As I said, just to test functionality for now I am using pbs_sched.
>     I will install Maui later, once Torque gets to work right.
>
>     Yesterday's scheduler log is below.
>     I haven't submitted a job today.
>
>     If there is a simple solution, please let me know.
>     Otherwise, I may need to try an older Torque version.
>     This is a machine waiting to enter production.
>
>     Thank you,
>     Gus Correa
>
>     07/22/2013 18:48:38;0002; pbs_sched.11810;Svr;Log;Log opened
>     07/22/2013 18:48:38;0002; pbs_sched.11810;Svr;TokenAct;Account file
>     /opt/torque/4.2.3.1/gnu-4.4.7/sched_priv/accounting/20130722
>     <http://4.2.3.1/gnu-4.4.7/sched_priv/accounting/20130722> opened
>     07/22/2013 18:48:38;0002;
>     pbs_sched.11811;Svr;main;/opt/torque/4.2.3.1/gnu-4.4.7/sbin/pbs_sched <http://4.2.3.1/gnu-4.4.7/sbin/pbs_sched>
>     startup pid 11811
>     07/22/2013 18:48:39;0080; pbs_sched.11811;Svr;main;brk point 29609984
>     07/22/2013 18:49:45;0080; pbs_sched.11811;Svr;main;brk point 29872128
>     07/22/2013 19:02:27;0002; pbs_sched.11811;Svr;die;caught signal 15
>     07/22/2013 19:02:27;0002; pbs_sched.11811;Svr;Log;Log closed
>     07/22/2013 19:02:27;0002; pbs_sched.13314;Svr;Log;Log opened
>     07/22/2013 19:02:27;0002; pbs_sched.13314;Svr;TokenAct;Account file
>     /opt/torque/4.2.3.1/gnu-4.4.7/sched_priv/accounting/20130722
>     <http://4.2.3.1/gnu-4.4.7/sched_priv/accounting/20130722> opened
>     07/22/2013 19:02:27;0002;
>     pbs_sched.13315;Svr;main;/opt/torque/4.2.3.1/gnu-4.4.7/sbin/pbs_sched <http://4.2.3.1/gnu-4.4.7/sbin/pbs_sched>
>     startup pid 13315
>     07/22/2013 19:03:01;0080; pbs_sched.13315;Svr;main;brk point 40480768
>     07/22/2013 19:04:07;0080; pbs_sched.13315;Svr;main;brk point 40742912
>
>
>     On 07/23/2013 01:00 AM, David Beer wrote:
>      > Gus,
>      >
>      > What scheduler are you using? What do your scheduler logs say?
>      >
>      > David
>      >
>      >
>      > On Mon, Jul 22, 2013 at 5:53 PM, Gus Correa
>     <gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>
>      > <mailto:gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>>> wrote:
>      >
>      >     Sorry. The subject line should read Torque 4.2.3.1, of course.
>      >
>      >     On 07/22/2013 07:50 PM, Gus Correa wrote:
>      > > Hello Torque experts
>      > >
>      > > I am trying Torque 4.2.3.1,
>      > > just with pbs_sched for the intial testing.
>      > > pbsnodes shows all nodes "free".
>      > > However, if I submit a job (simple, serial, hostname only),
>      > > the job stays in Q state forever, and only runs with qrun.
>      > > The server log shows messages like these:
>      > >
>      > > ******************************************************************
>      > > 07/22/2013
>      > >
>      >
>     19:37:40;0001;PBS_Server.13236;Svr;PBS_Server;LOG_ERROR::Operation now
>      > > in progress (115) in tcp_connect_sockaddr, Failed when trying
>     to open
>      > > tcp connection - connect() failed [rc = 15096] [addr =
>      > 10.10.1.8:15003 <http://10.10.1.8:15003> <http://10.10.1.8:15003>]
>      > >
>      > > 07/22/2013
>      > >
>      >
>     19:37:40;0001;PBS_Server.13236;Svr;PBS_Server;LOG_ERROR::send_hierarchy,
>      > > Could not send mom hierarchy to host node08:15003
>      > > ******************************************************************
>      > >
>      > > ... and goes on and on for the various nodes.
>      > >
>      > > I already restarted the server, the moms, and the scheduler several
>      > > times, but yanking them doesn't seem to do the trick.
>      > >
>      > > I found similar error reports in the mailing list,
>      > > but no clear solution.
>      > > Is there any?
>      > > Better use an older version of Torque?
>      > > Which one is free from this error?
>      > >
>      > > Thank you for your help,
>      > > Gus Correa
>      > > _______________________________________________
>      > > torqueusers mailing list
>      > > torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>     <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      > > http://www.supercluster.org/mailman/listinfo/torqueusers
>      >
>      >     _______________________________________________
>      >     torqueusers mailing list
>      > torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>     <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      > http://www.supercluster.org/mailman/listinfo/torqueusers
>      >
>      >
>      >
>      >
>      > --
>      > David Beer | Senior Software Engineer
>      > Adaptive Computing
>      >
>      >
>      > _______________________________________________
>      > torqueusers mailing list
>      > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>      > http://www.supercluster.org/mailman/listinfo/torqueusers
>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> --
> David Beer | Senior Software Engineer
> Adaptive Computing
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list