Performance Evaluation of a 3D-Stencil Library for Distributed Memory Array Accelerators

Yoshikazu INAGAKI  Shinya TAKAMAEDA-YAMAZAKI  Jun YAO  Yasuhiko NAKASHIMA  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E98-D   No.12   pp.2141-2149
Publication Date: 2015/12/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2015PAP0015
Type of Manuscript: Special Section PAPER (Special Section on Parallel and Distributed Computing and Networking)
Category: Architecture
Keyword: 
CGRA,  coarse grained reconfigurable architecture,  accelerator,  library,  stencil,  optimization,  

Full Text: PDF>>
Buy this Article




Summary: 
The Energy-aware Multi-mode Accelerator eXtension [24],[25] (EMAX) is equipped with distributed single-port local memories and ring-formed interconnections. The accelerator is designed to achieve extremely high throughput for scientific computations, big data, and image processing as well as low-power consumption. However, before mapping algorithms on the accelerator, application developers require sufficient knowledge of the hardware organization and specially designed instructions. They also need significant effort to tune the code for improving execution efficiency when no well-designed compiler or library is available. To address this problem, we focus on library support for stencil (nearest-neighbor) computations that represent a class of algorithms commonly used in many partial differential equation (PDE) solvers. In this research, we address the following topics: (1) system configuration, features, and mnemonics of EMAX; (2) instruction mapping techniques that reduce the amount of data to be read from the main memory; (3) performance evaluation of the library for PDE solvers. With the features of a library that can reuse the local data across the outer loop iterations and map many instructions by unrolling the outer loops, the amount of data to be read from the main memory is significantly reduced to a minimum of 1/7 compared with a hand-tuned code. In addition, the stencil library reduced the execution time 23% more than a general-purpose processor.