[torquedev] Disappearence of /dev/null
rea+maui at grid.kiae.ru
Thu Aug 5 11:50:40 MDT 2010
Today I had faced the problem when the majority of our nodes had
/dev/null as the empty regular file (and not the character device).
It is a known problem, it comes from time to time to our cluster
and others experiencing it as well: , . And there is even
a bug  in the Torque Bugzilla.
I had briefly examined the sources of pbs_mom and found that function
preobit_reply() has the following code that is executed if variable
deletejob is set to 1:
if (!(pjob->ji_wattr[(int)JOB_ATR_interactive].at_flags & ATR_VFLAG_SET) ||
(pjob->ji_wattr[(int)JOB_ATR_interactive].at_val.at_long == 0))
int x; /* dummy */
/* do this if not interactive */
unlink(std_file_name(pjob, StdOut, &x));
unlink(std_file_name(pjob, StdErr, &x));
unlink(std_file_name(pjob, Checkpoint, &x));
The thing is that std_file_name() can supply us "/dev/null" if
(pjob->ji_wattr[(int)JOB_ATR_keep].at_flags & ATR_VFLAG_SET) &&
evaluates to false.
I can not judge if these two conditions are orthogonal to each other,
but it seems to me that they are not, so this might so happen that
std_file_name() will really return us "/dev/null" and unlink will
be called for it.
By the way, the log from pbs_mom in  says that unlink happens
after the message "top of preobit_reply" and after the message
"unknown on server, deleting locally". And deletejob is set to
1 once pbs_mom will spit the last message. So, my scenario looks
not so improbable.
I propose a simple fix for this: proxy all unlink calls via a new
routine, pbs_unlink(), that will check if we are not deleting
"/dev/null" (or alike) and will write a log message (preferrably,
with the stack trace, looks like Linux supports this,
that will be proxied to syslog and pbs_mom log.
That's not a long term solution, but it will allow one to
- avoid these _very_ harmful errors: node becomes the job sucker
and destroyer, since every job ends with rc=-9 after such accident;
- catch the cases where /dev/null is going to be destroyed and
supply the developers with the additional useful information.
I will try to come up with the patch, but may be after weekend,
since I am currently not in mood to leave my cluster in the
experimental state for the Saturday and Sunday ;))
Thanks for your time.
Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"
More information about the torquedev