MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes

Namyoon WOO  Hyungsoo JUNG  Heon Young YEOM  Taesoon PARK  Hyungwoo PARK  

IEICE TRANSACTIONS on Information and Systems   Vol.E87-D   No.7   pp.1820-1828
Publication Date: 2004/07/01
Online ISSN: 
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Section on Hardware/Software Support for High Performance Scientific and Engineering Computing)
Category: Distributed, Grid and P2P Computing
fault-tolerance,  checkpoint,  consistent recovery,  MPI,  grid computing,  

Full Text: PDF>>
Buy this Article

Fault-tolerance is an essential feature of the distributed systems where the possibility of a failure increases with the growth of the system. In spite of extensive researches over two decades, fault-tolerance systems have not succeeded in practical use. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and the practice of fault-tolerance systems, and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management, and atomic message transfer. MPICH-GF requires no modification of application source codes, and it affects the MPICH communication characteristics as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the abstract device level. We have evaluated MPICH-GF using NPB applications on Globus middleware.