[torqueusers] pam_pbssimpleauth fails on Torque 4.3.2.1

Gus Correa gus at ldeo.columbia.edu
Tue Jul 23 14:21:40 MDT 2013


Dear Torque experts

As I reported in a related thread,
pam_pbssimpleauth doesn't seem to work on Torque 4.2.3.1.

Regular users are able to ssh to compute nodes where
they do NOT have jobs, even though I am using pam_pbssimpleauth
to block their use of sshd.

The secure logs on the nodes report they cannot dlopen
pam_pbssimpleauth  because of the undefined symbol
_Z14read_ac_socketiPvl

[The function read_ac_socket is called exactly once in pam_pbssimpleauth.c]

Here is a snippet of the secure log showing the failure
to load pam_pbssimpleauth and the unwarranted
login/ssh session of a regular user (me) being opened:

***************************************************************
Jul 23 16:07:34 node34 sshd[10567]: PAM unable to 
dlopen(/lib64/security/pam_pbssimpleauth.so): 
/lib64/security/pam_pbssimpleauth.so: undefined symbol: 
_Z14read_ac_socketiPvl
Jul 23 16:07:34 node34 sshd[10567]: PAM adding faulty module: 
/lib64/security/pam_pbssimpleauth.so
Jul 23 16:07:34 node34 sshd[10567]: Accepted hostbased for guscorrea 
from 10.10.1.100 port 53017 ssh2
Jul 23 16:07:34 node34 sshd[10567]: pam_unix(sshd:session): session 
opened for user guscorrea by (uid=0)
***************************************************************

Moreover, the self-extracting Torque PAM package installs
the libraries on /lib/security, not /lib64/security, as it
should be in an x86_64 system.  I moved the libraries manually
to /lib64/security.

Oddly, "make install" on the Torque server
installs the libraries correctly in /lib64/security.

I checked the Torque 4.2.1 code also,
but the pam_pbssimpleauth.c code is identical to 4.3.2.1 (no diff).
Hence, I am not hopeful that rolling back to 4.2.1 will solve
this problem.

I should add that I am using Linux CentOS 6.4 with stock
gcc, g++ 4.4.7 on x86_64 AMD boxes.

I found this old thread that reported problems with
"undefined symbol" on pam_pbssimpleauth back in
Torque 2.5.8.
The solution back then was to roll back to 2.5.7.

http://www.supercluster.org/pipermail/torqueusers/2011-October/013485.html


1) Any suggestions regarding the PAM module in the Torque 4.X series?
2) Is there a specific 4.X release where PAM works right?
3) Can the problem be fixed to make pam_pbssimpleauth work?
4) Or should I roll back to the 3.X or 2.X series?

BTW, I have two clusters running Torque 2.4.11 where
pam_pbssimpleauth works nicely.
5) Should I roll back that much?


Many thanks,
Gus Correa


On 07/23/2013 01:48 PM, Gus Correa wrote:
> Thank you, David.
>
> I followed your suggestion, built and installed maui 3.3.1,
> stopped pbs_sched, started maui.
>
> Now the simple serial "hostname" type of jobs run when submitted,
> no need for qrun.
> I still need to try parallel jobs.
>
> There remain a bunch of questions, to me at least:
>
> 1) Is pbs_sched phased out in Torque 4.X series?
>
> At least in 4.3.2.1 it doesn't seem to work, as per all that
> I reported on this thread.
> However, it is (or used to be) good enough for small clusters.
> I used it for quite a while in small production clusters.
> It is (was?) also a good tool to test if Torque is working.
>
> 2) Why am I getting these
> "undefined symbol: _Z14read_ac_socketiPvl"
> errors in the secure logs in the compute nodes?
>
> *****************
> Jul 22 19:10:51 node08 sshd[6845]: PAM unable to
> dlopen(/lib64/security/pam_pbssimpleauth.so):
> /lib64/security/pam_pbssimpleauth.so: undefined symbol:
> _Z14read_ac_socketiPvl
> Jul 22 19:10:51 node08 sshd[6845]: PAM adding faulty module:
> /lib64/security/pam_pbssimpleauth.so
> ******************
>
> Is pam_pbssimpleauth broken-built somehow in Torque 4.3.2.1?
> I want to use it, so this is concerning.
>
> **
>
> 3) Why does the self-extracting torque-pam package install
> pam-pbssimpleauth.* in /lib/security, instead of /lib64/security
> in x86_64 system?
>
> [Odd, because "make install" puts the pam libraries in the righg place,
> /lib64/security.]
>
> **
>
> I still have to build OpenMPI on top of Torque and try parallel jobs.
> However, I remain a bit worried of using the 4.3.2.1 version,
> given the errors I reported.
>
> Thank you,
> Gus Correa
>
> On 07/23/2013 11:47 AM, David Beer wrote:
>> I'm just thinking that if you say the job runs fine when a qrun is
>> executed but the scheduler doesn't start them, you probably want to look
>> into why the scheduler isn't scheduling them. I don't know how to debug
>> pbs_sched (or even how to begin). Are you planning to go into production
>> with Maui? If you are, I would try that out and see if it gives you any
>> problems.
>>
>> David
>>
>>
>> On Tue, Jul 23, 2013 at 8:15 AM, Gus Correa <gus at ldeo.columbia.edu
>> <mailto:gus at ldeo.columbia.edu>> wrote:
>>
>> Hi David
>>
>> As I said, just to test functionality for now I am using pbs_sched.
>> I will install Maui later, once Torque gets to work right.
>>
>> Yesterday's scheduler log is below.
>> I haven't submitted a job today.
>>
>> If there is a simple solution, please let me know.
>> Otherwise, I may need to try an older Torque version.
>> This is a machine waiting to enter production.
>>
>> Thank you,
>> Gus Correa
>>
>> 07/22/2013 18:48:38;0002; pbs_sched.11810;Svr;Log;Log opened
>> 07/22/2013 18:48:38;0002; pbs_sched.11810;Svr;TokenAct;Account file
>> /opt/torque/4.2.3.1/gnu-4.4.7/sched_priv/accounting/20130722
>> <http://4.2.3.1/gnu-4.4.7/sched_priv/accounting/20130722> opened
>> 07/22/2013 18:48:38;0002;
>> pbs_sched.11811;Svr;main;/opt/torque/4.2.3.1/gnu-4.4.7/sbin/pbs_sched
>> <http://4.2.3.1/gnu-4.4.7/sbin/pbs_sched>
>> startup pid 11811
>> 07/22/2013 18:48:39;0080; pbs_sched.11811;Svr;main;brk point 29609984
>> 07/22/2013 18:49:45;0080; pbs_sched.11811;Svr;main;brk point 29872128
>> 07/22/2013 19:02:27;0002; pbs_sched.11811;Svr;die;caught signal 15
>> 07/22/2013 19:02:27;0002; pbs_sched.11811;Svr;Log;Log closed
>> 07/22/2013 19:02:27;0002; pbs_sched.13314;Svr;Log;Log opened
>> 07/22/2013 19:02:27;0002; pbs_sched.13314;Svr;TokenAct;Account file
>> /opt/torque/4.2.3.1/gnu-4.4.7/sched_priv/accounting/20130722
>> <http://4.2.3.1/gnu-4.4.7/sched_priv/accounting/20130722> opened
>> 07/22/2013 19:02:27;0002;
>> pbs_sched.13315;Svr;main;/opt/torque/4.2.3.1/gnu-4.4.7/sbin/pbs_sched
>> <http://4.2.3.1/gnu-4.4.7/sbin/pbs_sched>
>> startup pid 13315
>> 07/22/2013 19:03:01;0080; pbs_sched.13315;Svr;main;brk point 40480768
>> 07/22/2013 19:04:07;0080; pbs_sched.13315;Svr;main;brk point 40742912
>>
>>
>> On 07/23/2013 01:00 AM, David Beer wrote:
>> > Gus,
>> >
>> > What scheduler are you using? What do your scheduler logs say?
>> >
>> > David
>> >
>> >
>> > On Mon, Jul 22, 2013 at 5:53 PM, Gus Correa
>> <gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>
>> > <mailto:gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>>> wrote:
>> >
>> > Sorry. The subject line should read Torque 4.2.3.1, of course.
>> >
>> > On 07/22/2013 07:50 PM, Gus Correa wrote:
>> > > Hello Torque experts
>> > >
>> > > I am trying Torque 4.2.3.1,
>> > > just with pbs_sched for the intial testing.
>> > > pbsnodes shows all nodes "free".
>> > > However, if I submit a job (simple, serial, hostname only),
>> > > the job stays in Q state forever, and only runs with qrun.
>> > > The server log shows messages like these:
>> > >
>> > > ******************************************************************
>> > > 07/22/2013
>> > >
>> >
>> 19:37:40;0001;PBS_Server.13236;Svr;PBS_Server;LOG_ERROR::Operation now
>> > > in progress (115) in tcp_connect_sockaddr, Failed when trying
>> to open
>> > > tcp connection - connect() failed [rc = 15096] [addr =
>> > 10.10.1.8:15003 <http://10.10.1.8:15003> <http://10.10.1.8:15003>]
>> > >
>> > > 07/22/2013
>> > >
>> >
>> 19:37:40;0001;PBS_Server.13236;Svr;PBS_Server;LOG_ERROR::send_hierarchy,
>> > > Could not send mom hierarchy to host node08:15003
>> > > ******************************************************************
>> > >
>> > > ... and goes on and on for the various nodes.
>> > >
>> > > I already restarted the server, the moms, and the scheduler several
>> > > times, but yanking them doesn't seem to do the trick.
>> > >
>> > > I found similar error reports in the mailing list,
>> > > but no clear solution.
>> > > Is there any?
>> > > Better use an older version of Torque?
>> > > Which one is free from this error?
>> > >
>> > > Thank you for your help,
>> > > Gus Correa
>> > > _______________________________________________
>> > > torqueusers mailing list
>> > > torqueusers at supercluster.org
>> <mailto:torqueusers at supercluster.org>
>> <mailto:torqueusers at supercluster.org
>> <mailto:torqueusers at supercluster.org>>
>> > > http://www.supercluster.org/mailman/listinfo/torqueusers
>> >
>> > _______________________________________________
>> > torqueusers mailing list
>> > torqueusers at supercluster.org
>> <mailto:torqueusers at supercluster.org>
>> <mailto:torqueusers at supercluster.org
>> <mailto:torqueusers at supercluster.org>>
>> > http://www.supercluster.org/mailman/listinfo/torqueusers
>> >
>> >
>> >
>> >
>> > --
>> > David Beer | Senior Software Engineer
>> > Adaptive Computing
>> >
>> >
>> > _______________________________________________
>> > torqueusers mailing list
>> > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>> > http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>
>> --
>> David Beer | Senior Software Engineer
>> Adaptive Computing
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>



More information about the torqueusers mailing list