Parallelism Analysis of H.264 Decoder and Realization on a Coarse-Grained Reconfigurable SoC

Gugang GAO  Peng CAO  Jun YANG  Longxing SHI  

IEICE TRANSACTIONS on Information and Systems   Vol.E96-D   No.8   pp.1654-1666
Publication Date: 2013/08/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.E96.D.1654
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Section on Reconfigurable Systems)
Category: Application
coarse-grained reconfigurable array,  SoC,  REMUS-II,  H.264 decoder,  mapping strategy,  hybrid partitioning,  sub-MB parallelism,  

Full Text: PDF>>
Buy this Article

One of the largest challenges for coarse-grained reconfigurable arrays (CGRAs) is how to efficiently map applications. The key issues for mapping are (1) how to reduce the memory bandwidth, (2) how to exploit parallelism in algorithms and (3) how to achieve load balancing and take full advantage of the hardware potential. In this paper, we propose a novel parallelism scheme, called ‘Hybrid partitioning’, for mapping a H.264 high definition (HD) decoder onto REMUS-II, a CGRA system-on-chip (SoC). Combining good features of data partitioning and task partitioning, our methodology mainly consists of three levels from top to bottom: (1) hybrid task pipeline based on slice and macroblock (MB) level; (2) MB row-level data parallelism; (3) sub-MB level parallelism method. Further, on the sub-MB level, we propose a few mapping strategies such as hybrid variable block size motion compensation (Hybrid VBSMC) for MC, 2D-wave for intra 44, parallel processing order for deblocking. With our mapping strategies, we improved the algorithm's performance on REMUS-II. For example, with a luma 1616 MB, the Hybrid VBSMC achieves 4 times greater performance than VBSMC and 2.2 times greater performance than fixed 44 partition approach. Finally, we achieve 1080p@33fps H.264 high-profile (HiP)@level 4.1 decoding when the working frequency of REMUS-II is 200 MHz. Compared with typical hardware platforms, we can achieve better performance, area, and flexibility. For example, our performance achieves approximately 175% improvement than that of a commercial CGRA processor XPP-III while only using 70% of its area.