A Novel Procedure for Implementing a Turbo Decoder on a GPU with Coalesced Memory Access

Heungseop AHN  Seungwon CHOI  

IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences   Vol.E100-A   No.5   pp.1188-1196
Publication Date: 2017/05/01
Online ISSN: 1745-1337
DOI: 10.1587/transfun.E100.A.1188
Type of Manuscript: PAPER
Category: Communication Theory and Signals
GPU,  CUDA,  turbo decoder,  coalesced memory access,  SDR,  

Full Text: PDF>>
Buy this Article

The sub-blocking algorithm has been known as a core component in implementing a turbo decoder using a Graphic Processing Unit (GPU) to use as many cores in the GPU as possible for parallel processing. However, even though the sub-blocking algorithm allows a large number of threads in a given GPU to be adopted for processing a large number of sub-blocks in parallel, each thread must access the global memory with strided addresses, which results in uncoalesced memory access. Because uncoalesced memory access causes a lot of unnecessary memory transactions, the memory bandwidth efficiency drops significantly, possibly as low as 1/8 in the case of an Long Term Evolution (LTE) turbo decoder, depending upon the compute capability of a GPU. In this paper, we present a novel method for converting uncoalesced memory access into coalesced access in a way that completely recovers the memory bandwidth efficiency to 100% without additional overhead. Our experimental tests, performed with NVIDIA's Geforce GTX 780 Ti GPU, show that the proposed method can enhance the throughput by nearly 30% compared with a conventional turbo decoder that suffers from uncoalesced memory access. Throughput provided by the proposed method has been observed to be 51.4Mbps when the number of iterations and that of sub-blocks are set to 6 and 32, respectively, in our experimental tests, which far exceeds the performance of previous works implemented the Max-Log-MAP algorithm.