[torquedev] job stuck in E on Leopard
Glen Beane
glen.beane at gmail.com
Fri Mar 14 22:58:51 MDT 2008
I am on a Leopard cluster that is experiencing the job stuck in E state for
15 minutes issue
from what I can tell the problem is due to some kind of failure when the mom
responds to the copy file request from the server to copy the stdout and
stderr files back.
in almost every case the stdout file makes it back OK, but the stderr file
doesn't. From what I can tell there is a for loop in req_cpyfile that
loops over all the files it should copy. req_cpyfile forks to the user, so
this is done as a child
to prove to my self that we should be copying both the .o and .e files in
this for loop I copied the for statement directly above itself and created
as simple for loop with one line that simply printed the contents of
pair->fp_rmt
it printed the remote location for both the .o and .e files, so I know the
big for loop should be looping over both of these files and it should be
for each of these files it should call sys_copy sys_copy will fork and the
child will do an execl to do the cp command.
I added a print statement to the top of the big for loop that should be
calling sys_copy for the .o and .e files and the for loop only gets executed
once, and I also had sys_copy create a file in /tmp uniquely named based on
its pid, this contained the actual cp command that is being executed. I
would see that sys_copy was being called for my .o file, but never for my .e
file. Its like the child forked in req_cpyfile was exiting before it could
loop onto the .e file, but like I said I confirmed that the same for loop
just containing a simple print statement would loop over both .o and .e
files
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20080315/64346699/attachment.html
More information about the torquedev
mailing list