Fault Tolerance in Decentralized Systems

Brian RANDELL  

Publication
IEICE TRANSACTIONS on Communications   Vol.E83-B   No.5   pp.903-907
Publication Date: 2000/05/25
Online ISSN: 
DOI: 
Print ISSN: 0916-8516
Type of Manuscript: INVITED PAPER (IEICE/IEEE Joint Special Issue on Autonomous Decentralized Systems)
Category: 
Keyword: 
concurrency,  error recovery,  co-ordinated atomic (CA) actions,  exception handling,  dependability,  

Full Text: PDF(481.7KB)>>
Buy this Article




Summary: 
In a decentralised system the problems of fault tolerance, and in particular error recovery, vary greatly depending on the design assumptions. For example, in a distributed database system, if one disregards the possibility of undetected invalid inputs or outputs, the errors that have to be recovered from will just affect the database, and backward error recovery will be feasible and should suffice. Such a system is typically supporting a set of activities that are competing for access to a shared database, but which are otherwise essentially independent of each other--in such circumstances conventional database transaction processing and distributed protocols enable backward recovery to be provided very effectively. But in more general systems the multiple activities will often not simply be competing against each other, but rather will at times be attempting to co-operate with each other, in pursuit of some common goal. Moreover, the activities in decentralised systems typically involve not just computers, but also external entities that are not capable of backward error recovery. Such additional complications make the task of error recovery more challenging, and indeed more interesting. This paper provides a brief analysis of the consequences of various such complications, and outlines some recent work on advanced error recovery techniques that they have motivated.