[torquedev] Re: communication problem on Leopard
Glen Beane
glen.beane at gmail.com
Wed Mar 12 19:26:56 MDT 2008
On Wed, Mar 12, 2008 at 9:19 PM, Glen Beane <glen.beane at gmail.com> wrote:
>
>
> On Wed, Mar 12, 2008 at 9:06 PM, Glen Beane <glen.beane at gmail.com> wrote:
>
> > hey fellow torquedevs,
> >
> > I've been investigating some communication problems that are popping up
> > on Leopard clusters all over the place
> >
> > the symptom is a job finishes on a node and pbs_mom sends the obit to
> > pbs_server. pbs_server logs that the obit has been received and the job
> > goes into the E state. Then things just hang there until 900 seconds pass
> > and pbs_mom times out:
> >
> > 03/12/2008 19:34:37;0001; pbs_mom;Svr;pbs_mom;wait_request, connection
> > 9 to host 2474513430 has timed out
> > out after 900 seconds - closing stale connection
> > 03/12/2008 19:34:37;0001; pbs_mom;Svr;pbs_mom;[continued]
> > 03/12/2008 19:36:17;0001; pbs_mom;Svr;pbs_mom;wait_request, connection
> > 10 to host 2474513430 has timed ou
> > t out after 900 seconds - closing stale connection
> > 03/12/2008 19:36:17;0001; pbs_mom;Svr;pbs_mom;[continued]
> > 03/12/2008 19:36:17;0001; pbs_mom;Svr;pbs_mom;wait_request, connection
> > 11 to host 2474513430 has timed ou
> > t out after 900 seconds - closing stale connection
> > 03/12/2008 19:36:17;0001; pbs_mom;Svr;pbs_mom;[continued]
> > 03/12/2008 19:36:17;0001; pbs_mom;Svr;pbs_mom;wait_request, connection
> > 12 to host 2474513430 has timed ou
> > t out after 900 seconds - closing stale connection
> >
> >
> >
> >
> > any ideas what may be going wrong here?
> >
>
>
> also note, in some cases we get this error:
>
> 03/12/2008 19:26:42;0080; pbs_mom;Job;569.host_removed_for_privacy;task
> 1 terminated
> 03/12/2008 19:26:42;0008; pbs_mom;Job;569.host_removed_for_privacy;job
> was terminated
> 03/12/2008 19:26:42;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 03/12/2008 19:26:42;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
> while loop
> 03/12/2008 19:26:42;0001;
> pbs_mom;Job;569.host_removed_for_privacy;preobit_reply, unknown on server,
> deleting locally
>
>
> other cases we don't:
>
> 03/12/2008 19:30:35;0001; pbs_mom;Job;TMomFinalizeJob3;job
> 599.brown.chem.luc.edu started, pid = 9229
> 03/12/2008 19:30:35;0080; pbs_mom;Job;599.host_removed_for_privacy;task
> 1 terminated
> 03/12/2008 19:30:35;0008; pbs_mom;Job;599.host_removed_for_privacy;job
> was terminated
> 03/12/2008 19:30:35;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 03/12/2008 19:30:35;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
> while loop
> 03/12/2008 19:30:35;0080; pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
> 03/12/2008 19:30:35;0080; pbs_mom;Job;599.host_removed_for_privacy;obit
> sent to server
>
also I'd like to add the stdout/stderr files for the jobs stuck in the E
state are being delivered
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20080312/6ff347c9/attachment.html
More information about the torquedev
mailing list