[torqueusers] Unable to get a simple job unqueued...
skip at pobox.com
skip at pobox.com
Mon Oct 11 10:40:46 MDT 2010
I'm having trouble getting a new Torque installation running on a different
subnet here. So far I have pbs_server, pbs_mom and maui all running on the
same host, known locally as druserver16.wackerbcp. It has a private IP
address: 192.168.66.214, but resolves both ways:
% host druserver16
druserver16.wackerbcp.<TLD> is an alias for
druserver16.vlan100.wackerbcp.<TLD>.
druserver16.vlan100.wackerbcp.<TLD> has address 192.168.66.214
druserver16.vlan100.wackerbcp.<TLD> mail is handled by 5
mailhost.wackerbcp.<TLD>.
druserver16.vlan100.wackerbcp.<TLD> mail is handled by 10
druserver16.vlan100.wackerbcp.<TLD>.
% host druserver16.wackerbcp
druserver16.wackerbcp.<TLD> is an alias for
druserver16.vlan100.wackerbcp.<TLD>.
druserver16.vlan100.wackerbcp.<TLD> has address 192.168.66.214
druserver16.vlan100.wackerbcp.<TLD> mail is handled by 10
druserver16.vlan100.wackerbcp.<TLD>.
druserver16.vlan100.wackerbcp.<TLD> mail is handled by 5
mailhost.wackerbcp.<TLD>.
% host 192.168.66.214
214.66.168.192.in-addr.arpa domain name pointer
druserver16.vlan100.wackerbcp.<TLD>.
("<TLD>" is our top-level domain.)
I successfully submitted a simple job:
echo 'echo hi' | qsub
but that job remains queued and won't run:
% qstat -1n
druserver16:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
2.druserver16 skipm batch STDIN -- -- -- -- -- Q -- --
Looking in server_logs/YYYYMMDD I see this warning:
10/11/2010 11:28:29;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact node druserver16.wackerbcp
but there is no further explanation of why the contact attempt failed. The
mom_logs/YYMMDD file shows:
10/11/2010 11:28:26;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server druserver16
10/11/2010 11:32:35;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.4.8, loglevel = 0
which looks okay to me.
I don't know enough about a queued job to know if Maui has done its work at
that point or not, but I do see these warnings in the maui.log file:
10/11 11:28:41 WARNING: no resources detected
10/11 11:28:41 WARNING: no workload detected
Any suggestions about where to look for the barrier to execution?
Thanks,
--
Skip Montanaro - skip at pobox.com - http://www.smontanaro.net/
More information about the torqueusers
mailing list