Se han probado varias librerías de checkpointing, con bastante poco éxito. Al final, he comprobado que funcionan:
![]() | Las librerías de Condor |
![]() | El paquete ckpt de Victor C. Zandy |
Ambas van con RH7.3, pero no con RH9.
Información copiada de http://gridengine.sunsource.net/project/gridengine/howto/condorckpt.html
NOTA: Todo parece funcionar correctamente en RedHat 7.3, pero NO en RH9. El paquete actual de Condor para RH9 sólo usa el universo "vanilla", luego no incluye librerías de checkpoint.
User-Level Checkpointing using Condor libraries
Overview
Sun Grid Engine provides general support for checkpointing, which can be categorized into three types:
Application level (checkpointing is hard coded in the application)
User level (using checkpointing libraries)
Kernel level (OS provided checkpointing)
This application note will relate to item 2 above. The User level checkpointing library selected here is the Condor standalone library from the University of Wisconsin's Condor system project at http://www.cs.wisc.edu/condor. The web site contains further information about Condor.
NOTE:The reader is warned here about the set of restrictions that come with user level checkpointing libraries. The following web page outlines the current Condor checkpointing libraries restrictions:
http://www.cs.wisc.edu/condor/manual/v6.2/1_4Current_Limitations.html
Condor Checkpointing library
Condor is a full job management system that includes a user level checkpointing library. The static Condor checkpointing library, when linked, provides a layer around the application to be checkpointed. The library will basically intercept the checkpointing signal and attempt to save the state of the application together with system information in a checkpointing file whose location is determined by the user during configuration time. The Condor checkpointing library can be used either as an integral part of the Condor system or as stand-alone with a separate resource management system such as the Sun Grid Engine product.
Standalone checkpointing library setup
This document (or How To) will only cover the standalone scenario that will be used with the checkpointing facility of the Sun Grid Engine software. In this case, there is no need to install the whole Condor software because we only care about the following:
a) the entire Condor "lib" subdirectory
b) the condor_compile command (from the bin subdirectory)
The condor_compile shell script file needs to be modified at the following line:
CONDOR_LIBDIR=`condor_config_val LIB`
to:
CONDOR_LIBDIR="install_path_of_condor_lib"
where "install_path_of_condor_lib" is the path to the entire contents of the Condor "lib" subdirectory. The above setup allows sequential applications to be checkpointed using the user level checkpointing Condor libraries.
Checkpointed Application Preparation
A regular application that needs to be checkpointed does not require any source level modifications. It only needs to be re-linked with the Condor checkpointing libraries to take advantage of the Checkpointing and Remote System Calls. An easy mechanism which is provided by Condor helps to perform the relink operation by using the condor_compile command as follows:
condor_compile -condor_standalone command [options/files....]
where command
is any of cc, f77, f90, ld, etc and [options/files....]
are the normal arguments used by the compiler/linker.
Configuring SGE's checkpointing environment
Configure host queue to support checkpointing
The application can be set up to checkpoint when the sge_execd daemon is shutdown and when the job is suspended.
Configure the job to be rescheduled in case it is suspended.
The checkpoint signal should be set up to SIGTSTP because Condor uses it to checkpoint the application and exit. There is however the SIGUSR2 signal used by Condor to checkpoint the application and let it continue its normal execution.
Finally, 'userdefined' checkpointing was set. Userdefined checkpointing means that the application periodically writes checkpoints without any interference by SGE. At restart time the application will continue at the last checkpoint.
Submitting a user level checkpointing job
The submission of a checkpointing job in a SGE environment is similar to the submission of a regular job with the addition of the following options to the qsub command:
-ckpt
-c [m|s|n|x].
EXAMPLE
Assumptions:
I assume that Condor libraries and condor_compile command are installed on the system.
The steps described below all have equivalent functions from the qmon(1) GUI. For the sake of simplicity the example is illustrated using the command line.
I used a cluster of 2 nodes named hpc0 and hpc1.
example.sh is the job script that contains the checkpointed application example_ckpt (source: example.c)
# cat example.sh #!/bin/sh cd /home/omar/wd ./example_ckpt exit
What follows is a series of steps to run our example:
Step 1: configure SGE for checkpointing.
1a. Create a checkpointing environment
From the command line type the following:
Type: qconf -ackpt condor_ckpt
An editing session will come up that looks like this:
ckpt_name condor_ckpt interface userdefined ckpt_command none migr_command none restart_command none clean_command none ckpt_dir /home/omar/tmp queue_list NONE signal none when sx
Edit the above to make them look like the following:
ckpt_name condor_ckpt interface USERDEFINED ckpt_command /home/omar/wd/SGE/checkpoint.sh migr_command /home/omar/wd/SGE/migrate.sh restart_command /home/omar/wd/SGE/restart.sh clean_command /home/omar/wd/SGE/clean.sh ckpt_dir /home/omar/tmp queue_list hpc0.q hpc1.q signal USR2 when sx
1b. Configure the queues hpc0.q and hpc1.q for checkpointing
qconf -mattr queue qtype "CHECKPOINTING BATCH INTERACTIVE" hpc0.q qconf -mattr queue qtype "CHECKPOINTING BATCH INTERACTIVE" hpc1.q
NOTE: The contents of the checkpt.sh and migrate.sh are:
# cat checkpt.sh #!/bin/sh #$ -S /bin/sh # The application spills it pid in the pid file. PID=`cat /home/omar/wd/SGE/pid` kill -USR2 $PID exit # cat migrate.sh #!/bin/sh #$ -S /bin/sh cd /home/omar/wd/SGE ./example_ckpt -_condor_restart /home/omar/wd/SGE/ss_condor.ckpt exit # cat clean.sh #!/bin/sh #$ -S /bin/sh cd /home/omar/wd # Remove the checkpoint saving file /usr/bin/rm -f /home/omar/wd/SGE/.ckpt # Remove the error log file /usr/bin/rm -f /home/omar/wd/SGE/submit_ss.e* # Remove the output file /usr/bin/rm -f /home/omar/wd/SGE/submit_ss.o* exit
NOTE: The restart.sh is the same as migrate.sh in our case. The clean.sh should be tailored to what the user wants to remove because some of the files will be needed for debugging.
Step 2: Prepare the application for checkpointing
2a. modify the line in the condor_compile script from:
CONDOR_LIBDIR=
to
CONDOR_LIBDIR=/condor/lib
"/condor/lib" is the install path to the entire contents of the Condor "lib" subdirectory.
2b. Execute the following command:
condor_compile -condor_standalone cc -o example_ckpt example.c
Step 3: Submit the checkpointed application.
3a. Execute the following command:
qsub -cwd -v TERM -ckpt condor_ckpt -c x example.script
I used the option "-c x" because I wanted the job to be checkpointed only when it is suspended. In this example, the job gets suspended through suspension of the queue on which the job is running (see step 4 & 5). The -c option provides other options and the reader is urged to consult the qsub(1) manual page.
Step 4: Suspend hpc0.q queue on which the checkpointing job is running.
4a. Execute the following command to suspend hpc0.q:
qmod -s hpc0.q
The job will migrate to the other queue hpc1.q.
Use "qstat -f" to check the new status of jobs/queues.
4b. Execute the following command to unsuspend hpc0.q:
qmod -us hpc0.q
Step 5: Suspend hpc1.q queue on which the checkpointing job is running.
5a. Execute the following command to suspend hpc1.q:
qmod -s hpc1.q
The job will migrate to the other queue : hpc0.q.
Use "qstat -f" to check the new status of jobs/queues.
5b. Execute the following command to unsuspend hpc1.q:
qmod -us hpc1.q
NOTE: Again, the above command line steps can all be substituted by functions executed from the qmon(1) GUI tool.
Disponible en www.cs.wisc.edu/~zandy/ckpt.
NOTA: Todo parece funcionar correctamente en RedHat 7.3, pero NO en RH9. He solicitado ayuda al autor.