<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">This might be way off, but sometimes NSCD in the nodes could be causing this.<div><br></div><div>Another thing would be to restart MOM on the node. I have seen Torque server not doing the right thing if there has been a change to the name services after it was started.</div><div><br></div><div>Hope that helps,</div><div>Prakash</div><div><br><div><div>On Feb 27, 2009, at 6:38 PM, Jim Turner wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div><p>I'm trying to submit a job on a cluster where users are authenticated using LDAP to a server external to the cluster. I can log in and ssh (without password) to any node in the cluster. But when I try to submit a job the MOM log says - cannot find user in password file...<br> <br> 02/27/2009 18:16:48;0001;   pbs_mom;Svr;pbs_mom;start_exec, no password entry for user crctst01<br> 02/27/2009 18:16:48;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters<br> 02/27/2009 18:16:48;0001;   pbs_mom;Svr;pbs_mom;exec_bail, exec_bail: sent 0 ABORT requests, should be 3<br> 02/27/2009 18:16:48;0008;   pbs_mom;Job;4.queuesrv1;Job Modified at request of <a href="mailto:PBS_Server@queuesrv1.hpc.louisville.edu">PBS_Server@queuesrv1.hpc.louisville.edu</a><br> 02/27/2009 18:16:48;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply<br> 02/27/2009 18:16:48;0080;   pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop<br> 02/27/2009 18:16:48;0080;   pbs_mom;Svr;preobit_reply;in while loop, no error from job stat<br> 02/27/2009 18:16:48;0008;   pbs_mom;Job;4.queuesrv1;checking job post-processing routine<br> 02/27/2009 18:16:48;0080;   pbs_mom;Job;4.queuesrv1;obit sent to server<br> 02/27/2009 18:16:48;0001;   pbs_mom;Svr;pbs_mom;Success (0) in fork_to_user, cannot find user 'crctst01' in password file<br> 02/27/2009 18:16:48;0080;   pbs_mom;Req;req_reject;Reject reply code=15023(Bad UID for job execution REJHOST=node312.hpc.louisville.edu MSG=cannot find user 'crctst01' in password file), aux=0, type=CopyFiles, from <a href="mailto:PBS_Server@queuesrv1.hpc.louisville.edu">PBS_Server@queuesrv1.hpc.louisville.edu</a><br> <br> This is that user on the node:<br> <br> crctst01@node312$ getent passwd crctst01<br> crctst01:*:100003:100001:crctst01:/home/crctst01:/bin/bash<br> crctst01@node312$ <br> <br> And if I read the code correctly.. I think that I'm getting rejected by this fragment in src/resmom/start_exec.c<br> <br> pwdp = getpwnam(ptr);<br> <br>  if (pwdp == NULL)<br>    {<br>    /* FAILURE */<br> <br>    sprintf(log_buffer, "no password entry for user %s",<br>      ptr);<br> <br>    return(NULL);<br>    }<br> <br> Putting together my own test case using getpwnam returns the correct value on that node. Anybody got an idea on how to debug this?<br> <br> <b><font face="Verdana">Jim Turner</font></b><font size="2" color="#808080" face="Verdana"><br> Cluster Enablement Team (CET) Senior Engineer<br> phone: 919-543-2505 / mobile: 919-381-8739<br> <a href="mailto:tjim@us.ibm.com">tjim@us.ibm.com</a></font><u><font size="2" color="#0000FF" face="Verdana"><br> </font></u><a href="http://www.ibm.com/systems/services/labservices"><u><font size="2" color="#0000FF" face="Verdana">ibm.com/systems/services/labservices</font></u></a><font size="2" color="#808080" face="Verdana"> <br> </font><u><font size="2" color="#0000FF" face="Verdana"><br> </font></u><a href="http://www.ibm.com/systems/services/labservices"><span>&lt;2F871306.jpg></span></a><font size="2" color="#808080" face="Verdana"><br> </font></p></div> _______________________________________________<br>torqueusers mailing list<br><a href="mailto:torqueusers@supercluster.org">torqueusers@supercluster.org</a><br>http://www.supercluster.org/mailman/listinfo/torqueusers<br></blockquote></div><br></div></body></html>