Hi there,<br><br>my setup is: <br><br>pbs_server: itanium witch debian linux<br>pbs_moms: itanium with HPUX 11.11<br><br>One of the moms system is more or less actual the other tow are behind. I managed to compile torque on all of them with gcc
3.3.3. Now it occurs, that the "actual" mom is running fine, wether the other to aren't accepting any jobs. Mainly qsub -I .. doesn't give me a shell.<br><br>mom_logs give me this:<br><br>02/27/2007 09:53:20;0008; pbs_mom;Job;process_request;request type QueueJob from host xxx received
<br>02/27/2007 09:53:20;0008; pbs_mom;Job;process_request;request type QueueJob from host xxx allowed<br>02/27/2007 09:53:20;0008; pbs_mom;Job;dispatch_request;dispatching request QueueJob on sd=9<br>02/27/2007 09:53:20;0008; pbs_mom;Job;process_request;request type ReadyToCommit from host xxx received
<br>02/27/2007 09:53:20;0008; pbs_mom;Job;process_request;request type ReadyToCommit from host xxx allowed<br>02/27/2007 09:53:20;0008; pbs_mom;Job;dispatch_request;dispatching request ReadyToCommit on sd=9<br>02/27/2007 09:53:20;0008; pbs_mom;Job;167.xxx;ready to commit job
<br>02/27/2007 09:53:20;0008; pbs_mom;Job;167.xxx;ready to commit job completed<br>02/27/2007 09:53:20;0008; pbs_mom;Job;process_request;request type Commit from host xxx received<br>02/27/2007 09:53:20;0008; pbs_mom;Job;process_request;request type Commit from host xxx allowed
<br>02/27/2007 09:53:20;0008; pbs_mom;Job;dispatch_request;dispatching request Commit on sd=9<br>02/27/2007 09:53:20;0008; pbs_mom;Job;167.xxx;committing job<br>02/27/2007 09:53:20;0008; pbs_mom;Job;167.xxx;starting job execution
<br>02/27/2007 09:53:20;0001; pbs_mom;Job;job_nodes;0: aaa1/0<br>02/27/2007 09:53:20;0001; pbs_mom;Job;job_nodes;job: 167.xxx numnodes=1 numvnod=1<br>02/27/2007 09:53:20;0001; pbs_mom;Job;167.xxx;phase 2 of job launch successfully completed
<br>02/27/2007 09:53:25;0001; pbs_mom;Job;167.xxx;job not ready after 5 second timeout, MOM will recheck<br>02/27/2007 09:53:25;0008; pbs_mom;Job;167.xxx;job execution started<br>02/27/2007 09:53:26;0001; pbs_mom;Job;167.xxx;job
167.xxx child not started, will check later<br>02/27/2007 09:53:26;0001; pbs_mom;Svr;pbs_mom;pbs_mom, wait_request failed<br>02/27/2007 09:53:27;0001; pbs_mom;Job;167.xxx;job 167.xxx child not started, will check later
<br><br>The last message is repeated several times.<br><br>gdb pbs_mom gives this:HP gdb 2.1<br>Copyright 1986 - 1999 Free Software Foundation, Inc.<br>Hewlett-Packard Wildebeest 2.1 (based on GDB 5.0-hpwdb-20000630)<br>Wildebeest is free software, covered by the GNU General Public License, and
<br>you are welcome to change it and/or distribute copies of it under certain<br>conditions. Type "show copying" to see the conditions. There is<br>absolutely no warranty for Wildebeest. Type "show warranty" for details.
<br>Wildebeest was built for PA-RISC 1.1 or 2.0 (narrow), HP-UX 11.00.<br>..<br>(gdb) run<br>Starting program: /opt/torque/sbin/pbs_mom<br>MOM is up<br>do_rpp: got a resource monitor request<br>do_rpp: got a resource monitor request
<br>saving extra job info stdout=0 stderr=0 taskid=1 nodeid=0<br>===== MD5 FFB6F44242AB30CD43FA2A743616A6B4<br>mom_do_poll: entered<br>warning: reading `r3' register: No data <----- This seems buggy
<br>Detaching after fork from process 16158<br>mom_close_poll: entered<br>saving extra job info stdout=-1 stderr=-1 taskid=2 nodeid=0<br>pbs_mom: pbs_mom, wait_request failed <--- this is bad either
<br>mom_get_sample: entered<br>sessions[0]: pid 878 sid 1528<br>sessions[1]: pid 872 sid 1528<br>sessions[1]: pid 871 sid 1528<br>sessions[1]: pid 879 sid 1528<br>sessions[1]: pid 880 sid 1528<br>sessions[1]: pid 876 sid 1528
<br>sessions[1]: pid 875 sid 1528<br>sessions[1]: pid 877 sid 1528<br>sessions[1]: pid 873 sid 1528<br>sessions[1]: pid 874 sid 1528<br>sessions[0]: pid 878 sid 1528<br>sessions[1]: pid 872 sid 1528<br>sessions[1]: pid 871 sid 1528
<br>sessions[1]: pid 879 sid 1528<br>sessions[1]: pid 880 sid 1528<br>sessions[1]: pid 876 sid 1528<br>sessions[1]: pid 875 sid 1528<br>sessions[1]: pid 877 sid 1528<br>sessions[1]: pid 873 sid 1528<br>sessions[1]: pid 874 sid 1528
<br>nusers[0]: pid 878 uid 30<br>nusers[1]: pid 872 uid 30<br>nusers[1]: pid 871 uid 30<br>nusers[1]: pid 879 uid 30<br>nusers[1]: pid 880 uid 30<br>nusers[1]: pid 876 uid 30<br>nusers[1]: pid 875 uid 30<br>nusers[1]: pid 877 uid 30
<br>nusers[1]: pid 873 uid 30<br>nusers[1]: pid 874 uid 30<br>mom_get_sample: entered<br><br>obs_mom seems to be unable to start a child progress. What can it be?<br><br>Wilhelm<br><br>