For Full-Text PDF, please login, if you are a member of IEICE,|
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
Performance Evaluation of a 3D-Stencil Library for Distributed Memory Array Accelerators
Yoshikazu INAGAKI Shinya TAKAMAEDA-YAMAZAKI Jun YAO Yasuhiko NAKASHIMA
IEICE TRANSACTIONS on Information and Systems
Publication Date: 2015/12/01
Online ISSN: 1745-1361
Type of Manuscript: Special Section PAPER (Special Section on Parallel and Distributed Computing and Networking)
CGRA, coarse grained reconfigurable architecture, accelerator, library, stencil, optimization,
Full Text: PDF>>
The Energy-aware Multi-mode Accelerator eXtension , (EMAX) is equipped with distributed single-port local memories and ring-formed interconnections. The accelerator is designed to achieve extremely high throughput for scientific computations, big data, and image processing as well as low-power consumption. However, before mapping algorithms on the accelerator, application developers require sufficient knowledge of the hardware organization and specially designed instructions. They also need significant effort to tune the code for improving execution efficiency when no well-designed compiler or library is available. To address this problem, we focus on library support for stencil (nearest-neighbor) computations that represent a class of algorithms commonly used in many partial differential equation (PDE) solvers. In this research, we address the following topics: (1) system configuration, features, and mnemonics of EMAX; (2) instruction mapping techniques that reduce the amount of data to be read from the main memory; (3) performance evaluation of the library for PDE solvers. With the features of a library that can reuse the local data across the outer loop iterations and map many instructions by unrolling the outer loops, the amount of data to be read from the main memory is significantly reduced to a minimum of 1/7 compared with a hand-tuned code. In addition, the stencil library reduced the execution time 23% more than a general-purpose processor.