[torquedev] communication problem on Leopard
Glen Beane
glen.beane at gmail.com
Wed Mar 12 19:06:08 MDT 2008
hey fellow torquedevs,
I've been investigating some communication problems that are popping up on
Leopard clusters all over the place
the symptom is a job finishes on a node and pbs_mom sends the obit to
pbs_server. pbs_server logs that the obit has been received and the job
goes into the E state. Then things just hang there until 900 seconds pass
and pbs_mom times out:
03/12/2008 19:34:37;0001; pbs_mom;Svr;pbs_mom;wait_request, connection 9
to host 2474513430 has timed out
out after 900 seconds - closing stale connection
03/12/2008 19:34:37;0001; pbs_mom;Svr;pbs_mom;[continued]
03/12/2008 19:36:17;0001; pbs_mom;Svr;pbs_mom;wait_request, connection 10
to host 2474513430 has timed ou
t out after 900 seconds - closing stale connection
03/12/2008 19:36:17;0001; pbs_mom;Svr;pbs_mom;[continued]
03/12/2008 19:36:17;0001; pbs_mom;Svr;pbs_mom;wait_request, connection 11
to host 2474513430 has timed ou
t out after 900 seconds - closing stale connection
03/12/2008 19:36:17;0001; pbs_mom;Svr;pbs_mom;[continued]
03/12/2008 19:36:17;0001; pbs_mom;Svr;pbs_mom;wait_request, connection 12
to host 2474513430 has timed ou
t out after 900 seconds - closing stale connection
any ideas what may be going wrong here?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20080312/ff23c5bf/attachment.html
More information about the torquedev
mailing list