[torquedev] Re: communication problem on Leopard
Glen Beane
glen.beane at gmail.com
Wed Mar 12 19:19:32 MDT 2008
On Wed, Mar 12, 2008 at 9:06 PM, Glen Beane <glen.beane at gmail.com> wrote:
> hey fellow torquedevs,
>
> I've been investigating some communication problems that are popping up on
> Leopard clusters all over the place
>
> the symptom is a job finishes on a node and pbs_mom sends the obit to
> pbs_server. pbs_server logs that the obit has been received and the job
> goes into the E state. Then things just hang there until 900 seconds pass
> and pbs_mom times out:
>
> 03/12/2008 19:34:37;0001; pbs_mom;Svr;pbs_mom;wait_request, connection 9
> to host 2474513430 has timed out
> out after 900 seconds - closing stale connection
> 03/12/2008 19:34:37;0001; pbs_mom;Svr;pbs_mom;[continued]
> 03/12/2008 19:36:17;0001; pbs_mom;Svr;pbs_mom;wait_request, connection
> 10 to host 2474513430 has timed ou
> t out after 900 seconds - closing stale connection
> 03/12/2008 19:36:17;0001; pbs_mom;Svr;pbs_mom;[continued]
> 03/12/2008 19:36:17;0001; pbs_mom;Svr;pbs_mom;wait_request, connection
> 11 to host 2474513430 has timed ou
> t out after 900 seconds - closing stale connection
> 03/12/2008 19:36:17;0001; pbs_mom;Svr;pbs_mom;[continued]
> 03/12/2008 19:36:17;0001; pbs_mom;Svr;pbs_mom;wait_request, connection
> 12 to host 2474513430 has timed ou
> t out after 900 seconds - closing stale connection
>
>
>
>
> any ideas what may be going wrong here?
>
also note, in some cases we get this error:
03/12/2008 19:26:42;0080; pbs_mom;Job;569.host_removed_for_privacy;task 1
terminated
03/12/2008 19:26:42;0008; pbs_mom;Job;569.host_removed_for_privacy;job was
terminated
03/12/2008 19:26:42;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
03/12/2008 19:26:42;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
while loop
03/12/2008 19:26:42;0001;
pbs_mom;Job;569.host_removed_for_privacy;preobit_reply, unknown on server,
deleting locally
other cases we don't:
03/12/2008 19:30:35;0001; pbs_mom;Job;TMomFinalizeJob3;job
599.brown.chem.luc.edu started, pid = 9229
03/12/2008 19:30:35;0080; pbs_mom;Job;599.host_removed_for_privacy;task 1
terminated
03/12/2008 19:30:35;0008; pbs_mom;Job;599.host_removed_for_privacy;job was
terminated
03/12/2008 19:30:35;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
03/12/2008 19:30:35;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
while loop
03/12/2008 19:30:35;0080; pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
03/12/2008 19:30:35;0080; pbs_mom;Job;599.host_removed_for_privacy;obit
sent to server
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20080312/4d9a6da8/attachment.html
More information about the torquedev
mailing list