Probabilistic Checkpointing

Hyochang NAM  Jong KIM  Sung Je HONG  Sunggu LEE  

IEICE TRANSACTIONS on Information and Systems   Vol.E85-D   No.7   pp.1093-1104
Publication Date: 2002/07/01
Online ISSN: 
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Fault Tolerance
checkpointing,  encoding,  rollback recovery,  fault tolerance,  aliasing,  

Full Text: PDF>>
Buy this Article

For checkpointing to be practical, it has to introduce low overhead for the targeted application. As a means of reducing the overhead of checkpointing, this paper proposes a probabilistic checkpointing method, which uses block encoding to detect the modified memory area between two consecutive checkpoints. Since the proposed technique uses block encoding to detect the modified area, the possibility of aliasing exists in encoded words. However, this paper shows that the aliasing probability is near zero when an 8-byte encoded word is used. The performance of the proposed technique is analyzed and measured by using experiments. An analytic model which predicts the checkpointing overhead is first constructed. By using this model, the block size that produces the best performance for a given target program is estimated. In most cases, medium block sizes, i.e., 128 or 256 bytes, show the best performance. The proposed technique has also been implemented on Unix based systems, and its performance has been measured in real environments. According to the experimental results, the proposed technique reduces the overhead by 11.7% in the best case and increases the overhead by 0.5% in the worst case in comparison with page-based incremental checkpointing.