For Full-Text PDF, please login, if you are a member of IEICE,|
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
A Novel Procedure for Implementing a Turbo Decoder on a GPU with Coalesced Memory Access
Heungseop AHN Seungwon CHOI
IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences
Publication Date: 2017/05/01
Online ISSN: 1745-1337
Type of Manuscript: PAPER
Category: Communication Theory and Signals
GPU, CUDA, turbo decoder, coalesced memory access, SDR,
Full Text: PDF(837.5KB)
>>Buy this Article
The sub-blocking algorithm has been known as a core component in implementing a turbo decoder using a Graphic Processing Unit (GPU) to use as many cores in the GPU as possible for parallel processing. However, even though the sub-blocking algorithm allows a large number of threads in a given GPU to be adopted for processing a large number of sub-blocks in parallel, each thread must access the global memory with strided addresses, which results in uncoalesced memory access. Because uncoalesced memory access causes a lot of unnecessary memory transactions, the memory bandwidth efficiency drops significantly, possibly as low as 1/8 in the case of an Long Term Evolution (LTE) turbo decoder, depending upon the compute capability of a GPU. In this paper, we present a novel method for converting uncoalesced memory access into coalesced access in a way that completely recovers the memory bandwidth efficiency to 100% without additional overhead. Our experimental tests, performed with NVIDIA's Geforce GTX 780 Ti GPU, show that the proposed method can enhance the throughput by nearly 30% compared with a conventional turbo decoder that suffers from uncoalesced memory access. Throughput provided by the proposed method has been observed to be 51.4Mbps when the number of iterations and that of sub-blocks are set to 6 and 32, respectively, in our experimental tests, which far exceeds the performance of previous works implemented the Max-Log-MAP algorithm.