[torquedev] Re: work around for jobs getting stuck in the E state
(for OS X Leopard)
Glen Beane
glen.beane at gmail.com
Wed Mar 26 19:33:07 MDT 2008
On Wed, Mar 26, 2008 at 8:23 PM, Glen Beane <glen.beane at gmail.com> wrote:
> I modified the subject to note this is for OS X Leopard users.
>
> On Wed, Mar 26, 2008 at 8:22 PM, Glen Beane <glen.beane at gmail.com> wrote:
>
> > after running configure, open up src/include/pbs_config.h, and search
> > for HAVE_WORDEXP. Comment out the #define HAVE_WORDEXP 1 (actually, to be
> > extra save, I change this to #undef HAVE_WORDEXP)
> >
> > run make, and make install
> >
> >
> > you can now run jobs on Leopard without jobs getting stuck in the E
> > state, and your stderr and stdout files should be delivered properly
> >
> >
> >
> > I will be looking into this to see if this is a bug in wordexp on
> > Leopard. The code inside HAVE_WORDEXP seems to work on every other OS that
> > has wordexp
> >
> >
> >
so I came to the conclusion that the problem was in the wordexp code by
noticing rcperr.xxx files in the torque/spool directory with "no such file
or directory" errors with garbage for the source file, and I suspected
something was getting corrupted. Then I noticed one of the pbs_mom children
was hanging around while a job was stuck in the E state. This was after the
stdout file was sucessfully copied. I attached a debugger to the pbs_mom
that was running as my user (it forks to the user of the job to do the
stdout/stderr copy), and noticed it was inside the wordexp() function. I
would run "continue" in gdb, wait a little while, and then interrupt the
program and run "where" and it would still be in wordexp()
once I disabled all the HAVE_WORDEXP code this problem went away completely
so anyone experienced with the wordexp() code:
can you think of some test cases we could program up in a simple .c file
that could let us know if this problem is in wordexp() on Leopard so I can
file a bug report with Apple?
There is also an unresolved issue with Open Directory (jobs exiting
immediately after a getpwnam() failure on pbs_mom), that Apple is looking
at. This problem also appears to affect SGE from a few things I have seen
posted in various mailing lists. The strange thing is I can't replicate the
getpwnam failure with a simple C program I wrote, even on a node that gives
me that error every time I try to run a job on it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20080326/6a8a662f/attachment.html
More information about the torquedev
mailing list