Hi all,<br><br>I'm having problem with newly installed torque/maui
system - a lot of jobs fails to run. They get assigned to the node but
then become Waiting in torque.<br><br>I searched through maui, torque server and mom logs and found this (this is one of many failing jobs):<br>
<br>showq:<br>BLOCKED JOBS----------------<br>JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME<br><br>1159111 samgrid Hold 1 3:00:00:00 Wed Feb 3 22:29:25<br><br>[root@torque ~]# tracejob 1159111<br>
/var/spool/torque/mom_logs/<div id=":10d" class="ii gt">20100203: No such file or directory<br>/var/spool/torque/sched_logs/20100203: No such file or directory<br><br>Job: <a href="http://1159111.torque.farm.particle.cz/" target="_blank">1159111.torque.farm.particle.cz</a><br>
<br>02/03/2010 22:29:25 S enqueuing into d0prod, state 1 hop 1<br>02/03/2010 22:29:25 S Job Queued at request of <a href="mailto:samgrid@sam2.farm.particle.cz" target="_blank">samgrid@sam2.farm.particle.cz</a>, owner = <a href="mailto:samgrid@sam2.farm.particle.cz" target="_blank">samgrid@sam2.farm.particle.cz</a>, job name = Z063015370, queue = d0prod<br>
02/03/2010 22:29:25 A queue=d0prod<br>02/03/2010 22:30:00 S Job Modified at request of <a href="mailto:root@torque.farm.particle.cz" target="_blank">root@torque.farm.particle.cz</a><br>02/03/2010 22:30:00 S Job Run at request of <a href="mailto:root@torque.farm.particle.cz" target="_blank">root@torque.farm.particle.cz</a><br>
02/03/2010 22:30:00 S Job Modified at request of <a href="mailto:root@torque.farm.particle.cz" target="_blank">root@torque.farm.particle.cz</a><br>02/03/2010 22:30:00 S post_modify_req: PBSE_UNKJOBID for job <a href="http://1159111.torque.farm.particle.cz/" target="_blank">1159111.torque.farm.particle.cz</a> in state RUNNING-STAGEGO, dest = salix37<br>
<br>[root@torque ~]# grep 1159111 /usr/local/maui/log/maui.log<br>02/03 22:57:30 MJobFind('1159111',J,0)<br>02/03 22:57:30 MRMJobPreUpdate(1159111)<br>02/03 22:57:30 MPBSJobUpdate(1159111,<a href="http://1159111.torque.farm.particle.cz/" target="_blank">1159111.torque.farm.particle.cz</a>,TaskList,0)<br>
02/03 22:57:30 __MPBSGetTaskList(1159111,1,TaskList,0)<br>02/03
22:57:30 INFO: job 1159111 starttime: 1265232592 (00:27:31)
presenttime: 1265234243 wclimit: 259200 mtime: 1265232600 etime: 0
walltime: 0 state: Hold<br>
02/03 22:57:30 MRMJobPostUpdate(1159111,TaskList,Hold,base)<br>02/03 22:57:30 INFO: job '1159111' Priority: 1<br>02/03 22:57:30 INFO: job '1159111' priority: 1.00<br>02/03 22:57:31 INFO: job '1159111' Priority: 1<br>
02/03 22:57:31 INFO: job '1159111' priority: 1.00<br>02/03 22:58:02 INFO: line: ' 1159111 samgrid 1265232592 1265232565 1 259200 - 6 1<br>02/03 22:58:39 MJobFind('1159111',J,0)<br>
02/03 22:58:39 MRMJobPreUpdate(1159111)<br>02/03 22:58:39 MPBSJobUpdate(1159111,<a href="http://1159111.torque.farm.particle.cz/" target="_blank">1159111.torque.farm.particle.cz</a>,TaskList,0)<br>02/03 22:58:39 __MPBSGetTaskList(1159111,1,TaskList,0)<br>
02/03 22:58:39 INFO: job 1159111 starttime: 1265232592 (00:28:40)
presenttime: 1265234312 wclimit: 259200 mtime: 1265232600 etime: 0
walltime: 0 state: Hold<br>02/03 22:58:39 MRMJobPostUpdate(1159111,TaskList,Hold,base)<br>
02/03 22:58:40 INFO: job '1159111' Priority: 1<br>02/03 22:58:40 INFO: job '1159111' priority: 1.00<br>02/03 22:58:40 INFO: job '1159111' Priority: 1<br>02/03 22:58:40 INFO: job '1159111' priority: 1.00<br>
<br>[root@torque ~]# ssh salix37 "grep 1159111 /var/spool/torque/mom_logs/*"<br>/var/spool/torque/mom_logs/20100203:02/03/2010 22:30:00;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=<a href="http://salix37.farm.particle.cz/" target="_blank">salix37.farm.particle.cz</a> MSG=modify job failed, unknown job <a href="http://1159111.torque.farm.particle.cz/" target="_blank">1159111.torque.farm.particle.cz</a>), aux=0, type=ModifyJob, from <a href="mailto:PBS_Server@torque.farm.particle.cz" target="_blank">PBS_Server@torque.farm.particle.cz</a><br>
<br>I think the problem is somehow connected with the PBSE_UNKJOBID
error, but I didn't found any solution. To me it seems strange, that
the pbs_mom is staging in files, but doesn't know the job...<br><br>Thank you for any help.<br>
<br>Best regards,<br><font color="#888888">Jan Svec<br>Institute of Physics AS CR</font></div>