A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems

Yonghwan KIM  Tadashi ARARAGI  Junya NAKAMURA  Toshimitsu MASUZAWA  

IEICE TRANSACTIONS on Information and Systems   Vol.E97-D   No.1   pp.65-76
Publication Date: 2014/01/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.E97.D.65
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Dependable Computing
fault-tolerance,  large-scale distributed system,  concurrent snapshot,  checkpoint,  rollback,  

Full Text: PDF>>
Buy this Article

Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.