[torquedev] Disappearence of /dev/null

Eygene Ryabinkin rea+maui at grid.kiae.ru
Thu Aug 5 11:50:40 MDT 2010


Me again.

Today I had faced the problem when the majority of our nodes had
/dev/null as the empty regular file (and not the character device).
It is a known problem, it comes from time to time to our cluster
and others experiencing it as well: [1], [2].  And there is even
a bug [3] in the Torque Bugzilla.

I had briefly examined the sources of pbs_mom and found that function
preobit_reply() has the following code that is executed if variable
deletejob is set to 1:
{{{
    if (!(pjob->ji_wattr[(int)JOB_ATR_interactive].at_flags & ATR_VFLAG_SET) ||
        (pjob->ji_wattr[(int)JOB_ATR_interactive].at_val.at_long == 0))
      {
      int x; /* dummy */

      /* do this if not interactive */
      unlink(std_file_name(pjob, StdOut, &x));
      unlink(std_file_name(pjob, StdErr, &x));
      unlink(std_file_name(pjob, Checkpoint, &x));
      }
}}}

The thing is that std_file_name() can supply us "/dev/null" if
the condition
{{{
(pjob->ji_wattr[(int)JOB_ATR_keep].at_flags & ATR_VFLAG_SET) &&
(strchr(pjob->ji_wattr[(int)JOB_ATR_keep].at_val.at_str, key))
}}}
evaluates to false.

I can not judge if these two conditions are orthogonal to each other,
but it seems to me that they are not, so this might so happen that
std_file_name() will really return us "/dev/null" and unlink will
be called for it.

By the way, the log from pbs_mom in [3] says that unlink happens
after the message "top of preobit_reply" and after the message
"unknown on server, deleting locally".  And deletejob is set to
1 once pbs_mom will spit the last message.  So, my scenario looks
not so improbable.

I propose a simple fix for this: proxy all unlink calls via a new
routine, pbs_unlink(), that will check if we are not deleting
"/dev/null" (or alike) and will write a log message (preferrably,
with the stack trace, looks like Linux supports this,
http://www.gnu.org/software/libc/manual/html_node/Backtraces.html)
that will be proxied to syslog and pbs_mom log.

That's not a long term solution, but it will allow one to
- avoid these _very_ harmful errors: node becomes the job sucker
  and destroyer, since every job ends with rc=-9 after such accident;
- catch the cases where /dev/null is going to be destroyed and
  supply the developers with the additional useful information.

I will try to come up with the patch, but may be after weekend,
since I am currently not in mood to leave my cluster in the
experimental state for the Saturday and Sunday ;))

Thanks for your time.

[1] http://permalink.gmane.org/gmane.comp.clustering.torque.user/7844
[2] https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1007&L=LCG-ROLLOUT&F=&S=&P=100149
[3] http://www.clusterresources.com/bugzilla/show_bug.cgi?id=61
-- 
Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"


More information about the torquedev mailing list