[torquedev] Re: communication problem on Leopard

Glen Beane glen.beane at gmail.com
Wed Mar 12 19:19:32 MDT 2008


On Wed, Mar 12, 2008 at 9:06 PM, Glen Beane <glen.beane at gmail.com> wrote:

> hey fellow torquedevs,
>
> I've been investigating some communication problems that are popping up on
> Leopard clusters all over the place
>
> the symptom is a job finishes on a node and pbs_mom sends the obit to
> pbs_server.  pbs_server logs that the obit has been received and the job
> goes into the E state.  Then things just hang there until 900 seconds pass
> and pbs_mom times out:
>
> 03/12/2008 19:34:37;0001;   pbs_mom;Svr;pbs_mom;wait_request, connection 9
> to host 2474513430 has timed out
>  out after 900 seconds - closing stale connection
> 03/12/2008 19:34:37;0001;   pbs_mom;Svr;pbs_mom;[continued]
> 03/12/2008 19:36:17;0001;   pbs_mom;Svr;pbs_mom;wait_request, connection
> 10 to host 2474513430 has timed ou
> t out after 900 seconds - closing stale connection
> 03/12/2008 19:36:17;0001;   pbs_mom;Svr;pbs_mom;[continued]
> 03/12/2008 19:36:17;0001;   pbs_mom;Svr;pbs_mom;wait_request, connection
> 11 to host 2474513430 has timed ou
> t out after 900 seconds - closing stale connection
> 03/12/2008 19:36:17;0001;   pbs_mom;Svr;pbs_mom;[continued]
> 03/12/2008 19:36:17;0001;   pbs_mom;Svr;pbs_mom;wait_request, connection
> 12 to host 2474513430 has timed ou
> t out after 900 seconds - closing stale connection
>
>
>
>
> any ideas what may be going wrong here?
>


also note, in some cases we get this error:

03/12/2008 19:26:42;0080;   pbs_mom;Job;569.host_removed_for_privacy;task 1
terminated
03/12/2008 19:26:42;0008;   pbs_mom;Job;569.host_removed_for_privacy;job was
terminated
03/12/2008 19:26:42;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
03/12/2008 19:26:42;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
while loop
03/12/2008 19:26:42;0001;
pbs_mom;Job;569.host_removed_for_privacy;preobit_reply, unknown on server,
deleting locally


other cases we don't:

03/12/2008 19:30:35;0001;   pbs_mom;Job;TMomFinalizeJob3;job
599.brown.chem.luc.edu started, pid = 9229
03/12/2008 19:30:35;0080;   pbs_mom;Job;599.host_removed_for_privacy;task 1
terminated
03/12/2008 19:30:35;0008;   pbs_mom;Job;599.host_removed_for_privacy;job was
terminated
03/12/2008 19:30:35;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
03/12/2008 19:30:35;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
while loop
03/12/2008 19:30:35;0080;   pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
03/12/2008 19:30:35;0080;   pbs_mom;Job;599.host_removed_for_privacy;obit
sent to server
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20080312/4d9a6da8/attachment.html


More information about the torquedev mailing list