A Tree-Based Checkpointing Architecture for the Dependability of FPGA Computing

Hoang-Gia VU  Shinya TAKAMAEDA-YAMAZAKI  Takashi NAKADA  Yasuhiko NAKASHIMA  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E101-D   No.2   pp.288-302
Publication Date: 2018/02/01
Online ISSN: 1745-1361
Type of Manuscript: Special Section PAPER (Special Section on Reconfigurable Systems)
Category: Device and Architecture
Keyword: 
checkpointing,  FPGA,  dependability,  tree-based,  

Full Text: PDF(2.1MB)
>>Buy this Article


Summary: 
Modern FPGAs have been integrated in computing systems as accelerators for long running applications. This integration puts more pressure on the fault tolerance of computing systems, and the requirement for dependability becomes essential. As in the case of CPU-based system, checkpoint/restart techniques are also expected to improve the dependability of FPGA-based computing. Three issues arise in this situation: how to checkpoint and restart FPGAs, how well this checkpoint/restart model works with the checkpoint/restart model of the whole computing system, and how to build the model by a software tool. In this paper, we first present a new checkpoint/restart architecture along with a checkpointing mechanism on FPGAs. We then propose a method to capture consistent snapshots of FPGA and the rest of the computing system. Third, we provide “fine-grained” management for checkpointing to reduce performance degradation. For the host CPU, we also provide a stack which includes API functions to manage checkpoint/restart procedures on FPGAs. Fourth, we present a Python-based tool to insert checkpointing infrastructure. Experimental results show that the checkpointing architecture causes less than 10% maximum clock frequency degradation, low checkpointing latencies, small memory footprints, and small increases in power consumption, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).