<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 14 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri","sans-serif";
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal>Hi all,<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>I’m using a cluster that uses Torque as the batch system. About half of the time, checkpointing with DMTCP fails while copying the temporary output buffer/file:<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>cp -f /var/spool/torque/spool/<i>jobid</i>.<i>myserver</i>.OU /<i>checkpoint_dir</i>/ckpt_<i>myprog</i>_52b886013bb1c112-27763-51060104_files/<i>jobid</i>.<i>myserver</i>.OU_99001<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>I’m using dmtcp_checkpoint (v1.2.6) with the --checkpoint-open-files option. All I know is that the copy command fails, not why (though I know the destination directory exists and it does work about half the time). Can anyone explain why the OU file might not exist at the time of checkpointing, or what else might be the cause of the failure?<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Thanks,<o:p></o:p></p><p class=MsoNormal>Kit<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p></div></body></html>