A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM

Yibo FAN  Leilei HUANG  Kewei CHEN  Xiaoyang ZENG  

IEICE TRANSACTIONS on Electronics   Vol.E103-C   No.5   pp.263-273
Publication Date: 2020/05/01
Publicized: 2019/11/27
Online ISSN: 1745-1353
DOI: 10.1587/transele.2019ECP5008
Type of Manuscript: PAPER
Category: Integrated Electronics
Recurrent Neural Networks (RNN),  Long Short-Term Memory (LSTM),  hardware implementation,  

Full Text: PDF>>
Buy this Article

The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.