Checkpointing

Se han probado varias librerías de checkpointing, con bastante poco éxito. Al final, he comprobado que funcionan:

Las librerías de Condor
El paquete ckpt de Victor C. Zandy

Ambas van con RH7.3, pero no con RH9.

SGE con Condor

Información copiada de http://gridengine.sunsource.net/project/gridengine/howto/condorckpt.html

NOTA: Todo parece funcionar correctamente en RedHat 7.3, pero NO en RH9. El paquete actual de Condor para RH9 sólo usa el universo "vanilla", luego no incluye librerías de checkpoint.

User-Level Checkpointing using Condor libraries

Overview

Sun Grid Engine provides general support for checkpointing, which can be categorized into three types:

  1. Application level (checkpointing is hard coded in the application)

  2. User level (using checkpointing libraries)

  3. Kernel level (OS provided checkpointing)

This application note will relate to item 2 above. The User level checkpointing library selected here is the Condor standalone library from the University of Wisconsin's Condor system project at http://www.cs.wisc.edu/condor. The web site contains further information about Condor.

NOTE:The reader is warned here about the set of restrictions that come with user level checkpointing libraries. The following web page outlines the current Condor checkpointing libraries restrictions:

http://www.cs.wisc.edu/condor/manual/v6.2/1_4Current_Limitations.html

Condor Checkpointing library

Condor is a full job management system that includes a user level checkpointing library. The static Condor checkpointing library, when linked, provides a layer around the application to be checkpointed. The library will basically intercept the checkpointing signal and attempt to save the state of the application together with system information in a checkpointing file whose location is determined by the user during configuration time. The Condor checkpointing library can be used either as an integral part of the Condor system or as stand-alone with a separate resource management system such as the Sun Grid Engine product.

Standalone checkpointing library setup

This document (or How To) will only cover the standalone scenario that will be used with the checkpointing facility of the Sun Grid Engine software. In this case, there is no need to install the whole Condor software because we only care about the following:

a) the entire Condor "lib" subdirectory

b) the condor_compile command (from the bin subdirectory)

The condor_compile shell script file needs to be modified at the following line:

CONDOR_LIBDIR=`condor_config_val LIB`

to:

CONDOR_LIBDIR="install_path_of_condor_lib" 

where "install_path_of_condor_lib" is the path to the entire contents of the Condor "lib" subdirectory. The above setup allows sequential applications to be checkpointed using the user level checkpointing Condor libraries.

Checkpointed Application Preparation

A regular application that needs to be checkpointed does not require any source level modifications. It only needs to be re-linked with the Condor checkpointing libraries to take advantage of the Checkpointing and Remote System Calls. An easy mechanism which is provided by Condor helps to perform the relink operation by using the condor_compile command as follows:

condor_compile -condor_standalone command [options/files....] 

where command is any of cc, f77, f90, ld, etc and [options/files....] are the normal arguments used by the compiler/linker.

Configuring SGE's checkpointing environment

  1. Configure host queue to support checkpointing

  2. The application can be set up to checkpoint when the sge_execd daemon is shutdown and when the job is suspended.

  3. Configure the job to be rescheduled in case it is suspended.

  4. The checkpoint signal should be set up to SIGTSTP because Condor uses it to checkpoint the application and exit. There is however the SIGUSR2 signal used by Condor to checkpoint the application and let it continue its normal execution.

  5. Finally, 'userdefined' checkpointing was set. Userdefined checkpointing means that the application periodically writes checkpoints without any interference by SGE. At restart time the application will continue at the last checkpoint.

Submitting a user level checkpointing job

The submission of a checkpointing job in a SGE environment is similar to the submission of a regular job with the addition of the following options to the qsub command:

  1. -ckpt

  2. -c [m|s|n|x].

EXAMPLE

Assumptions:

  1. I assume that Condor libraries and condor_compile command are installed on the system.

  2. The steps described below all have equivalent functions from the qmon(1) GUI. For the sake of simplicity the example is illustrated using the command line.

  3. I used a cluster of 2 nodes named hpc0 and hpc1.

  4. example.sh is the job script that contains the checkpointed application example_ckpt (source: example.c)

   # cat example.sh
   #!/bin/sh 
   cd /home/omar/wd
   ./example_ckpt
   exit 

What follows is a series of steps to run our example:

Step 1: configure SGE for checkpointing.

1a. Create a checkpointing environment
From the command line type the following:
Type: qconf -ackpt condor_ckpt
An editing session will come up that looks like this:

        ckpt_name          condor_ckpt
        interface          userdefined
        ckpt_command       none
        migr_command       none
        restart_command    none
        clean_command      none
        ckpt_dir           /home/omar/tmp
        queue_list         NONE
        signal             none
        when               sx


Edit the above to make them look like the following:

        ckpt_name          condor_ckpt
        interface          USERDEFINED
        ckpt_command       /home/omar/wd/SGE/checkpoint.sh
        migr_command       /home/omar/wd/SGE/migrate.sh
        restart_command    /home/omar/wd/SGE/restart.sh
        clean_command      /home/omar/wd/SGE/clean.sh
        ckpt_dir           /home/omar/tmp
        queue_list         hpc0.q hpc1.q
        signal             USR2
        when               sx


1b. Configure the queues hpc0.q and hpc1.q for checkpointing

qconf -mattr queue qtype "CHECKPOINTING BATCH INTERACTIVE" hpc0.q 
qconf -mattr queue qtype "CHECKPOINTING BATCH INTERACTIVE" hpc1.q

NOTE: The contents of the checkpt.sh and migrate.sh are:

        # cat checkpt.sh
        #!/bin/sh
        #$ -S /bin/sh
        # The application spills it pid in the pid file.
        PID=`cat /home/omar/wd/SGE/pid`
        kill -USR2 $PID
        exit
        
        # cat migrate.sh
        #!/bin/sh 
        #$ -S /bin/sh
        cd /home/omar/wd/SGE
        ./example_ckpt -_condor_restart /home/omar/wd/SGE/ss_condor.ckpt
        exit

        # cat clean.sh
        #!/bin/sh 
        #$ -S /bin/sh
        cd /home/omar/wd
        # Remove the checkpoint saving file
        /usr/bin/rm -f /home/omar/wd/SGE/.ckpt
        # Remove the error log file 
        /usr/bin/rm -f /home/omar/wd/SGE/submit_ss.e*
        # Remove the output file 
        /usr/bin/rm -f /home/omar/wd/SGE/submit_ss.o*
        exit 

NOTE: The restart.sh is the same as migrate.sh in our case. The clean.sh should be tailored to what the user wants to remove because some of the files will be needed for debugging.

Step 2: Prepare the application for checkpointing

2a. modify the line in the condor_compile script from:

CONDOR_LIBDIR=

to

CONDOR_LIBDIR=/condor/lib

"/condor/lib" is the install path to the entire contents of the Condor "lib" subdirectory.

2b. Execute the following command:

condor_compile -condor_standalone cc -o example_ckpt example.c 

Step 3: Submit the checkpointed application.

3a. Execute the following command:

qsub -cwd -v TERM -ckpt condor_ckpt -c x example.script 

I used the option "-c x" because I wanted the job to be checkpointed only when it is suspended. In this example, the job gets suspended through suspension of the queue on which the job is running (see step 4 & 5). The -c option provides other options and the reader is urged to consult the qsub(1) manual page.

Step 4: Suspend hpc0.q queue on which the checkpointing job is running.

4a. Execute the following command to suspend hpc0.q:

qmod -s hpc0.q 

The job will migrate to the other queue hpc1.q.

Use "qstat -f" to check the new status of jobs/queues.

4b. Execute the following command to unsuspend hpc0.q:

qmod -us hpc0.q 

Step 5: Suspend hpc1.q queue on which the checkpointing job is running.

5a. Execute the following command to suspend hpc1.q:

qmod -s hpc1.q 

The job will migrate to the other queue : hpc0.q.

Use "qstat -f" to check the new status of jobs/queues.

5b. Execute the following command to unsuspend hpc1.q:

qmod -us hpc1.q 

NOTE: Again, the above command line steps can all be substituted by functions executed from the qmon(1) GUI tool.

CKPT 1.3

Disponible en www.cs.wisc.edu/~zandy/ckpt.

NOTA: Todo parece funcionar correctamente en RedHat 7.3, pero NO en RH9. He solicitado ayuda al autor.