PREGMA: A New Fault Tolerant Cluster Using COTS Components for Internet Services

Takeshi MISHIMA  Takeshi AKAIKE  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E86-D   No.12   pp.2517-2526
Publication Date: 2003/12/01
Online ISSN: 
DOI: 
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Issue on Dependable Computing)
Category: Dependable Systems
Keyword: 
fault tolerance,  low cost,  throughput overhead,  data integrity,  recovery time,  

Full Text: PDF>>
Buy this Article




Summary: 
We propose a new dependable system called PREGMA (Platform for Reliable Environment based on a General-purpose Machine Architecture). PREGMA aims to meet two requirements -- fault tolerance and low cost -- for Internet services. It can provide fault tolerance, so we can avoid system failure and prevent data corruption, even if faults occur. That is, it masks the faults by running multiple replicated servers, each possessing its own data, in a loosely synchronized manner and delivering the majority vote as output to clients. Moreover, PREGMA is composed of COTS (Commercial Off-The-Shelf) components without modification, which makes it possible to offer the services at a low cost. We investigated two approaches for achieving redundancy of the Coordinator, which is the core of PREGMA: using the primary backup method and the active replication method. We evaluated the effectiveness of PREGMA in terms of throughput overhead, data integrity and recovery time. The results for a prototype show that PREGMA using the Coordinator with the primary backup method outperforms that with the active replication method and has throughput only 3% lower than a non-redundant system. The results also show that, in the event of failure, the recovery time is only less than one second and no data corruption occurs.