Analytical Model on Hybrid State Saving with a Limited Number of Checkpoints and Bound Rollbacks

Mamoru OHARA  Ryo SUZUKI  Masayuki ARAI  Satoshi FUKUMOTO  Kazuhiko IWASAKI  

IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences   Vol.E89-A   No.9   pp.2386-2395
Publication Date: 2006/09/01
Online ISSN: 1745-1337
DOI: 10.1093/ietfec/e89-a.9.2386
Print ISSN: 0916-8508
Type of Manuscript: PAPER
Category: Reliability, Maintainability and Safety Analysis
reliability,  distributed systems,  hybrid state saving Time Warp simulation,  evaluation model,  

Full Text: PDF>>
Buy this Article

This paper discusses distributed checkpointing with logging for practical applications running with limited resources. We present a discrete time model evaluating the total expected overhead per event where the number of available checkpoints that each process can hold is finite. The rollback distance is also bound to some finite interval in many actual applications. Therefore, the recovery overhead for the checkpointing scheme is described by using a truncated geometric distribution as the rollback distance distribution. Although it is difficult to analytically derive the optimal checkpoint interval, which minimizes the total expected overhead, substituting other simple probabilistic distributions instead of the truncated geometric distribution enables us to do this explicitly. Numerical examples obtained through simulations are presented to show that we can achieve almost minimized total overhead by using the new models and analyses.