[torquedev] Bug in post_epilogue()
Dave Jackson
jacksond at clusterresources.com
Tue Aug 28 10:38:38 MDT 2007
Garrick,
I don't see a fix in trunk. In fact, the bug was first detected and
reported on a recent trunk based distribution. Is there a possibility a
fix did not get committed? Does the fix involve mompost return code
checks and retry from within scan_for_terminated()?
Dave
On Mon, 2007-08-27 at 20:40 -0700, Garrick Staples wrote:
> On Mon, Aug 27, 2007 at 09:36:46PM -0600, David B Jackson alleged:
> > post_epilogue() appears to have an issue in which if the pbs_mom daemon
> > fails to successfully send an obit message to the server on its first
> > attempt, it does not retry and from the point of view of the server,
> > jobs appear to hang for an extended period of time and cannot be killed.
> > It appears this routine has some code borrowed from scan_for_exiting()
> > which is retried but does not have the required recall points to allow
> > the same approach to work.
> >
> > Basically, the question is if post_epilogue()->client_to_svr() fails,
> > how does mom know to re-call post_epilogue()? scan_for_terminated will
> > execute pjob->ji_mompost which pushes the obit but does not check the
> > routine's return code, and NULL's out pjob->ji_mompost in all cases
> > preventing post_epilogue() from ever being run again.
> >
> > Are there suggestions for caching this request and making certain that
> > the obit makes it back to the server?
>
> Isn't that fixed in trunk?
>
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
More information about the torquedev
mailing list