WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs

Xinhai XU  Xuejun YANG  Yufei LIN  

IEICE TRANSACTIONS on Information and Systems   Vol.E95-D   No.3   pp.786-796
Publication Date: 2012/03/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.E95.D.786
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Computer System
Application-Level Checkpointing,  weak coordinated,  MPI,  fault tolerance,  consistency,  

Full Text: PDF>>
Buy this Article

As supercomputers increase in size, the mean time between failures (MTBF) of a system becomes shorter, and the reliability problem of supercomputers becomes more and more serious. MPI is currently the de facto standard used to build high-performance applications, and researches on the fault tolerance methods of MPI are always hot topics. However, due to the characteristics of MPI programs, most current checkpointing methods for MPI programs need to modify the MPI library (even operating system), or implement a complicated protocol by logging lots of messages. In this paper, we carry forward the idea of Application-Level Checkpointing (ALC). Based on the general fact that programmers are familiar with the communication characteristics of applications, we have developed BC-ALC, a new portable blocking coordinated ALC for MPI programs. BC-ALC neither modifies the MPI library (even operating system) nor logs any message. It implements coordination only by the Barrier operations instead of any complicated protocol. Furthermore, in order to reduce the cost of fault-tolerance, we reduce the synchronization range of the barrier, and design WBC-ALC, a weak blocking coordinated ALC utilizing group synchronization instead of global synchronization based on the communication relationship between processes. We also propose a fault-tolerance framework developed on top of WBC-ALC and discuss an implementation of it. Experimental results on NPB3.3-MPI benchmarks validate BC-ALC and WBC-ALC, and show that compared with BC-ALC, the average coordination time and the average backup time of a single checkpoint in WBC-ALC are reduced by 44.5% and 5.7% respectively.