Hi Danny,<br> Is there a need for checkpointing mpirun/mpiexec processes(Please correct me if I am wrong). They are spawning MPI program on defined nodes. For restarting a checkpointed MPI program, a fresh instance of mpirun, mpiexec or pbsdsh can be used.<br>
<br><div class="gmail_quote">On Mon, Jul 5, 2010 at 8:09 PM, Danny Sternkopf <span dir="ltr"><<a href="mailto:dsternkopf@hpce.nec.com">dsternkopf@hpce.nec.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi,<br>
<br>
thank you Peter!<br>
<br>
I have the impression that PBS Mom calls the checkpoint and restart<br>
scripts only once. Therefore the scripts must take care of the total<br>
batch job and all processes belonging to it, right?<br>
<br>
Do these scripts really work for you?<br>
<br>
I see that the MPI processes are checkpointed, but the mpirun and the<br>
batch scripts keeps running.<br>
<br>
Btw. mpiexec and mpirun are equivalent synonyms for orterun. They should<br>
be matched in the pgrep as well.<br>
<br>
Best regards,<br>
<font color="#888888"><br>
Danny<br>
</font><div><div></div><div class="h5"><br>
On 7/1/2010 1:29 PM, Peter Kruse wrote:<br>
> Hello,<br>
><br>
> I attach the two scripts that we use. They are based on the scripts<br>
> found on<br>
> <a href="http://www.clusterresources.com/products/torque/docs/2.6jobcheckpoint.shtml" target="_blank">http://www.clusterresources.com/products/torque/docs/2.6jobcheckpoint.shtml</a><br>
> But with these additions:<br>
><br>
> blcr_checkpoint_script:<br>
><br>
> 1. support ompi-checkpoint and cr_checkpoint<br>
> it checks if orterun is a parent process, if so uses ompi-checkpoint<br>
> otherweise uses cr_checkpoint<br>
> 2. for ompi-checkpoint the checkpoint directory cannot be given on<br>
> commandline, orterun uses the parameter snapc_base_global_snapshot_dir<br>
> which is already set. Therefore ignore the $checkpointDir/$checkpointName.<br>
> Instead store a mapping of $JOBID:$Snapref in a file (where $Snapref is<br>
> returned by the ompi-checkpoint command). Additionally store the<br>
> node geometry which is used in a script that restarts the job.<br>
><br>
> blcr_restart_script:<br>
><br>
> 3. if the given $jobid is found in the jobid2ompi_snap_ref file then<br>
> use "ompi-restart $ref" otherwise use cr_restart with the given<br>
> checkpointFile.<br>
><br>
> ql-restart-torque-ompi-job:<br>
><br>
> this script is meant to be run in a Torque Job, so that<br>
> $PBS_NODEILFE is set. If given the JobID to restart<br>
> it will first check if the node geometry matches the one<br>
> of that job, if it matches then calls ompi-restart with<br>
> the snapshot reference.<br>
><br>
> I hope they may be useful for you.<br>
><br>
> Regards,<br>
><br>
> Peter<br>
><br>
</div></div><div><div></div><div class="h5">_______________________________________________<br>
torquedev mailing list<br>
<a href="mailto:torquedev@supercluster.org">torquedev@supercluster.org</a><br>
<a href="http://www.supercluster.org/mailman/listinfo/torquedev" target="_blank">http://www.supercluster.org/mailman/listinfo/torquedev</a><br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br>Regards--<br>Rishi Pathak<br>National PARAM Supercomputing Facility<br>Center for Development of Advanced Computing(C-DAC)<br>Pune University Campus,Ganesh Khind Road<br>
Pune-Maharastra<br>