A Coarse-Grain Hierarchical Technique for 2-Dimensional FFT on Configurable Parallel Computers

Xizhen XU  Sotirios G. ZIAVRAS  

IEICE TRANSACTIONS on Information and Systems   Vol.E89-D   No.2   pp.639-646
Publication Date: 2006/02/01
Online ISSN: 1745-1361
DOI: 10.1093/ietisy/e89-d.2.639
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Section on Parallel/Distributed Computing and Networking)
Category: Parallel/Distributed Algorithms
configurable computing,  FPGA,  SIMD,  parallel processing,  memory switching,  FFT,  hardware-software codesign,  

Full Text: PDF(417.3KB)>>
Buy this Article

FPGAs (Field-Programmable Gate Arrays) have been widely used as coprocessors to boost the performance of data-intensive applications [1],[2]. However, there are several challenges to further boost FPGA performance: the communication overhead between the host workstation and the FPGAs can be substantial; large-scale applications cannot fit in a single FPGA because of its limited capacity; mapping an application algorithm to FPGAs still remains a daunting job in configurable system design. To circumvent these problems, we propose in this paper the FPGA-based Hierarchical-SIMD (H-SIMD) machine with its codesign of the Pyramidal Instruction Set Architecture (PISA). PISA comprises high-level instructions implemented as FPGA functions of coarse-grain SIMD (Single-Instruction, Multiple-Data) tasks to facilitate ease of program development, code portability across different H-SIMD implementations and high performance. We assume a multi-FPGA board where each FPGA is configured as a separate SIMD machine. Multiple FPGA chips can work in unison at a higher SIMD level, if needed, controlled by the host. Additionally, by using a memory switching scheme and the high-level PISA to partition applications into coarse-grain tasks, host-FPGA communication overheads can be hidden. We enlist the two-dimensional Fast Fourier Transform (2D FFT) to test the effectiveness of H-SIMD. The test results show sustained high performance for this problem. The H-SIMD machine even outperforms a Xeon processor for this problem.