Efficient Recovery from Communication Errors in Distributed Shared Memory Systems

Jenn-Wei LIN  Sy-Yen KUO  

IEICE TRANSACTIONS on Information and Systems   Vol.E81-D   No.11   pp.1213-1223
Publication Date: 1998/11/25
Online ISSN: 
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Fault Tolerant Computing
communication errors,  distributed shared memory systems,  damage,  loss,  retransmission latency,  

Full Text: PDF(921.5KB)>>
Buy this Article

This paper investigates the problem of communication errors in distributed shared memory (DSM) systems. Communication errors can introduce two critical problems: damage and loss. The damage problem makes the transmitted data destroyed and then produces incorrect computational results. The loss problem causes the transmitted data lost during transmission and then not received. However, the loss problem can be easily resolved using acknowledgement. Therefore, we focus on how to efficiently handle the damage problem. In DSM systems, the size of data transferred between nodes is larger than the size actually shared between nodes. That is, when a processing node receives data, not all the data items in this received data will be used. Based on this property, we present a new technique to resolve the data damage problem in DSM systems. This technique allows a processing node to continue computation without being blocked to wait for the correct data when it receives damaged data. Therefore, the latency for handling the data damage can be hidden. However, there is an optimistic assumption made in the proposed technique. If this optimistic assumption is not valid, the latency will not be hidden. To show the advantage and the overhead of the proposed technique, we perform extensive trace-driven simulations. The simulation results show that at least 62% of the latency for handling data damage can be hidden.