that is, all replicas are active and compute concurrently a different piece of the application parallel code. In our model replicas cooperate not only to detect and mask failures but also to perform parallel computation. The replication mechanisms are supported by FTAG run time system and are fully application-transparent. Different novel mechanisms for checkpointing and recovery are developed. In our model during rollback recovery only that part of the computation that was detected faulty is discarded. The replication technique takes full advantage of parallel computing to reduce overall computation time." />


A Novel Replication Technique for Detecting and Masking Failures for Parallel Software: Active Parallel Replication

Adel CHERIF  Masato SUZUKI  Takuya KATAYAMA  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E80-D   No.9   pp.886-892
Publication Date: 1997/09/25
Online ISSN: 
DOI: 
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Issue on Architectures, Algorithms and Networks for Massively Parallel Computing)
Category: Fault Tolerance
Keyword: 
fault tolerant systems,  distributed systems,  parallel computing,  functional paradigm,  checkpointing and recovery,  

Full Text: PDF>>
Buy this Article




Summary: 
We present a novel replication technique for parallel applications where instances of the replicated application are active on different group of processors called replicas. The replication technique is based on the FTAG (Fault Tolerant Attribute Grammar) computation model. FTAG is a functional and attribute based model. The developed replication technique implements "active parallel replication," that is, all replicas are active and compute concurrently a different piece of the application parallel code. In our model replicas cooperate not only to detect and mask failures but also to perform parallel computation. The replication mechanisms are supported by FTAG run time system and are fully application-transparent. Different novel mechanisms for checkpointing and recovery are developed. In our model during rollback recovery only that part of the computation that was detected faulty is discarded. The replication technique takes full advantage of parallel computing to reduce overall computation time.