[torqueusers] torque/maui hanging bug(?)

Will Nolan will at headlandstech.com
Mon Aug 9 13:14:09 MDT 2010


Using torque-2.6.0-snap.201008061539 and maui-3.3, I encountered some strange behavior when scheduling jobs where the maui scheduler would get "hung up" on communication with the server.  I finally tracked it down to this message in the maui log file:

INFO:     starting iteration 50
MRMGetInfo()
MClusterClearUsage()
MRMClusterQuery()
MPBSClusterQuery(abc.xyz.com,RCount,SC)
ERROR:    cannot get node info: NULL

The behavior observed is that after some length of time (several minutes), finally maui is able to continue and then begins scheduling jobs again.
I should mention that nscd is running on both machines, that had solved an earlier problem.  From previous Google searches I noticed a few folks had encountered this problem, but my guess is it's not usually noticed as anyone with relatively long-running jobs would have no idea that the scheduler had gotten hung up.  The only way we noticed it was because we were testing a fairly intensive set of short-running jobs that we expected to finish soon.

I was able to reproduce this problem fairly regularly, so I attached to maui with gdb and found some code that I believe is responsible.  It turns out this code is in torque's src/lib/Libifl/pbsD_connect.c, around line 900:

  if ((encode_DIS_ReqHdr(sock, PBS_BATCH_Disconnect, pbs_current_user) == 0) &&
      (DIS_tcp_wflush(sock) == 0))
    {
    int atime;

    struct sigaction act;

    struct sigaction oldact;

    /* set alarm to break out of potentially infinite read */

    act.sa_handler = SIG_IGN;
    sigemptyset(&act.sa_mask);
    act.sa_flags = 0;
    sigaction(SIGALRM, &act, &oldact);

    atime = alarm(pbs_tcp_timeout);

    /* NOTE:  alarm will break out of blocking read even with sigaction ignored */

    while (1)
      {
      /* wait for server to close connection */

      /* NOTE:  if read of 'sock' is blocking, request below may hang forever */

      if (read(sock, &x, sizeof(x)) < 1)
        break;
      }

    alarm(atime);

    sigaction(SIGALRM, &oldact, NULL);
    }

close(sock);

My understanding of this is, for some reason the client is trying to disconnect from the server.  To do so, it expects to get a -1 on a read from the (blocking) socket to the server, i.e. it expects the server to close it from its end.  It sets a signal handler to effect a timeout on the read.  pbs_tcp_timeout was set to 9 (seconds) when I was attached.

The comments suggesting that setting SIG_IGN for the alarm handler will still result in the blocking read being interrupted are incorrect, however.  I believe this may be implementation-specific, but it definitely is not the case on our version of Linux (fc12).  I also don't see why it would ever be reasonable to expect this to behave like this.  A simple test program proves the point:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <stdint.h>
#include <sys/types.h>
#include <sys/stat.h>
  #include <fcntl.h>
#include <signal.h>

void handler(int signo)
{
  fprintf(stderr, "Caught signal #%d\n", signo);
}


int main(int argc_, char **argv_)
{
  struct sigaction act;
  struct sigaction oldact;

//  act.sa_handler = SIG_IGN;
  act.sa_handler = handler;
  sigemptyset(&act.sa_mask);
  act.sa_flags = 0;
  sigaction(SIGALRM, &act, &oldact);
  int atime = alarm(10);

  char buf[10];
  ssize_t br = read(0, buf, 10);

  fprintf(stderr, "Broke out of read with br = %ld\n", br);
}

Run this program as-is and the read from stdin will get interrupted after 10 seconds, and the read will return -1.  However, switch the comment line to use SIG_IGN and the read will block indefinitely.

I don't understand pbs_server well enough to know why takes so long to disconnect a client, but it is not unreasonable for there to be a very long delay there as it is not a high priority action.  However, I believe the code as written is incorrect, and leads to schedulers like maui which use torque's client libraries to get hung up unreasonably.  Perhaps this is also the case for pbs_sched.

I made a change to our local copy of the source where I installed an empty signal handler (i.e. "void foo(int signo) {}", and set act.sa_handler = foo), along with some debugging printouts.  I recompiled torque and maui, and I was able to verify from the maui logs that the timeout now gets properly handled, and maui was able to continue gracefully.

In any case, I'd like to solicit some feedback:


-          Do the developers agree with my assessment of the problem?

-          If so, are there other spots in the code that need to be fixed as well?

Many thanks,
William

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100809/f06ae929/attachment.html 


More information about the torqueusers mailing list