In Search of the Performance- and Energy-Efficient CNN Accelerators*

Stanislav SEDUKHIN†, Nonmember, Yoichi TOMIOKA††, and Kohei YAMAMOTO†††, Members

SUMMARY In this paper, starting from the algorithm, a performance- and energy-efficient 3D structure or shape of the Tensor Processing Engine (TPE) for CNN acceleration is systematically searched and evaluated. An optimal accelerator’s shape maximizes the number of concurrent MAC operations per clock cycle while minimizing the number of redundant operations. The proposed 3D vector-parallel TPE architecture with an optimal shape can be very efficiently used for considerable CNN acceleration. Due to implemented support of inter-block image data independency, it is possible to use multiple of such TPEs for the additional CNN acceleration. Moreover, it is shown that the proposed TPE can also be uniformly used for acceleration of the different CNN models such as VGG, ResNet, YOLO, and SSD. We also demonstrate that our theoretical efficiency analysis is matched with the result of a real implementation for an SSD model to which a state-of-the-art channel pruning technique is applied.

key words: multi-channel convolution, tensor processing, vector-parallel computing, data reusing, computing efficiency.

1. Introduction

Deep neural networks (DNNs) have recently demonstrated their extraordinary ability to make predictions on large amounts of data with very high accuracy. A very popular class of DNNs, commonly used to analyze visual imagery, is multi-layer convolutional neural networks (CNNs) where basic processing represents the layer-to-layer changeable multi-channel 2D convolution. For some CNNs, this type of convolution represents more than 90% of the total workload [1–3]. Each such convolution requires a massive execution of the same scalar multiply-accumulate (MAC) operation on the multi-dimensional tensor data [4]. A native algorithm for multi-channel 2D convolution exhibits an enormous concurrency of MAC operations, and therefore, hardware accelerators with a big number of MAC units are a natural solution to such computational requirements imposed by CNNs.

As soon as it was recognized that a multi-channel 2D convolution used in CNN can be effectively converted to the general matrix-by-matrix multiply-add (GEMM) [5–7], the previously well-known systolic array processing (SAP) [8,9] began to be widely used in designing CNN accelerators. For example, Google, in a few generations of Tensor Processing Units (TPUs) [10–13], has used the sizeable systolic array architectures to greatly accelerate convolution operations. These accelerators can be efficiently used in various stages of the machine learning: whether in training or inference, on edge devices or on the cloud [14].

The main advantages of SAP, which are favorable to VLSI systems [15,16], can be listed as follows: (1) homogeneity and simplicity of processing elements (PEs) to implement MAC operation; (2) a structural regularity and planarity of processing; (3) a locality of mesh-like PEs interconnect; (4) a deep data reuse by data flows across the array processor which drastically reduces the number of expensive memory references.

However, despite these beneficial factors, SAP has substantial disadvantages which may limit a practical usage for some applications. Some of these disadvantages are: (1) relatively big latency due to pipelining of processing via data streaming through the array of simple PEs (the latency is proportional to the size of systolic array); (2) even if all 2D data are available and ready for processing, e.g., all pixels in an image, such matrix data should be pre-arranged and loaded vector-by-vector through peripheral PEs into a planar systolic array. To reduce a relatively big startup latency of systolic array processing, but keep the same throughput of computing, i.e., keep the same number of concurrent MAC operations per clock cycle, Google has changed the size and number of the systolic-based Matrix Multiply Units (MXUs) in the different generations of TPUs from a single huge 256×256 array of 8-bit MAC units [10] in the TPUv1 to the four smaller 128×128 arrays of MXUs per chip in the two latest TPU versions [13,17]. Note that pipelining of vector-by-vector distribution in a planar systolic array processor has been replaced with a vector-by-vector broadcast [18, 19] in the non-systolic Tesla’s Full-Self-Driving (FSD) chip which totally eliminate the startup latency and mesh-like interconnect of systolic processing [3, 20].

In this paper we are going to eliminate or reduce the SAP disadvantages while keeping the main advantages by, firstly, assuming that all data are available for computing and, secondly, data are reused inside array processor not by streaming across PEs but by circulation amongst PEs such that all resulting data remains inside an array processor and ready for the further computing and reusing. Moreover, we will demonstrate the preferences of CNN acceleration by using not GEMM-based planar approach, but by direct 3D implementation of a multi-channel 2D convolution which does not require data rearrangement. Another our target is

† The author is with the University of Aizu, Japan
†† The corresponding author. The author is with the University of Aizu, Japan
††† The author is with the Oki Electric Industry Co., Ltd., Japan
* This paper was presented at the 2021 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)
to find an optimal single structure or 3D shape of CNN accelerator which maximizes the number of concurrent MAC results per clock cycle while minimizing the number of redundant MAC operations when implementing all layer-by-layer multi-channel convolutions for any given CNN model.

2. Multi-channel 2D Convolution

2.1 General Equation and Operational Concurrency

A multi-channel 2D convolution can be defined for each output element by the following output-centric [21] equation:

\[ Y_{m,r,c} = \sum_{n=1}^{N} \sum_{i \in I} \sum_{j \in J} X_{n,r+i,c+j} \cdot W_{n,m,i,j} + b_m, \]  

(1)

where \( Y_{M \times R \times C} \) is a 3D tensor of output feature maps (OFMs), \( X_{N \times M \times K} \) is a 3D tensor of input feature maps (IFMs), \( W_{N \times M \times K} \) is a 4D tensor of parameter maps or learned filter coefficients (weights), \( b_m \) is a 1D vector of biases, \( N \) and \( M \) are the number of input and output channels, respectively, and the filter window size is \( K \times K \). Note that the order of summations in Eq. (1) is not explicitly defined, i.e., any order is permissible.

The data used in the multi-channel 2D convolution are shown schematically in Fig. 1(a) where the spatial sizes of input and output feature maps are usually making equal by proper zero padding of IFM. The size of padding is computed as \( p = (K - 1)/2 \). Note that for each CONV layer the input OFMs \( Y_{in} \) should be padded with a zero tensor or populated by a bias tensor for the first layer or an OFM tensor from the previous layer to implement a possible residual operation or short-cut connection [22, 23] (see Fig. 1(a)).

The \( \forall \)-quantifier "for each" in Eq. (1) clearly shows independent element-wise computing of the OFM \( Y_{out} \) where each output \( Y(m, r, c) \)-element requires exactly \( K^2 \times N \) scalar multiply-accumulate (MAC) operations in the form of

\[ y \leftarrow x \cdot w + y. \]

Obviously, the total number of these MAC operations to compute Eq. (1) is \( K^2 \times N \times MRC \) and if each MAC operation can be executed in a single time-step or clock cycle then Eq. (1) can be computed sequentially element-by-element in \( K^2 \times N \times MRC \) time-steps.

Each MAC operation requires three-read/one-write memory accesses to deliver three operands to the MAC unit and return resulting operand to the memory. This required data movement from/to four-ported memory is considered to be the most expensive part of MAC operation in terms of the latency and energy consumption [17, 24, 25]. A well-known technique to diminish this problem is a deep data reusing in each level of memory hierarchy (the main memory, caches, register files) and between MAC units (like in systolic array processors).

Note that the total number of involved data elements, i.e., the IFM, filter coefficients and initial OFM is \( RCN + K^2 \times N \times M + RC \), i.e., the MAC intensity or the average number of concurrent MACs per a single three-operand data access can be estimated as a ratio \([K^2 \times NMRC/(RCN + K^2 \times NM + RC)]\). The MAC intensity can also be interpreted as degree of single operand reusing. For example, the MAC intensity \( K^2 \times NMRC/RCN = K^2M \) shows that each IFM-element \( x(n,r,c) \) after reading from the memory can be reused in \( K^2M \) different MAC operations while the MAC intensity \( K^2 \times NMRC/K^2 \times NM = RC \) shows the number of MAC operations which can reuse the same filter coefficient \( w(n,m,i,j) \in W_{in} \).

The \( \forall \)-quantifier also demonstrates that potentially possible MAC concurrency is \( MRC \). Therefore, a computing of Eq. (1) can also be implemented in \( K^2N \) time-steps with the \( MRC \) intermediate MAC results per time-step. This potentially possible 3D MAC concurrency consists of the spatial \((R \times C)\) parallelism of pixels in the IFM and vector parallelism of \( M \) output channels. Computing of Eq. (1) can be expressed in the form of holistic program (see Fig. 1(b)) which keeps an \( M \times R \times C \) scalar MAC concurrency inherent in Eq. (1) on each time-sep and, therefore, shows how to compute (1) in \( K^2N \) time-steps.

3. Scalable Vector-parallel Tensor Computing

The multi-channel 2D convolution (1) can be viewed as a set of 2D convolutions for Multiple-Input and Multiple-Output (MIMO) channels where each set consists of a nested sub-set of 2D convolutions for Single-Input Multiple-Output (SIMO) channels which, in turn, includes the lowest level of a nested sub-sub-set of 2D convolutions for Single-Input Single-Output (SISO) channels.

3.1 SISO Channels 2D Convolution

The SISO convolution is a standard 2D convolution of a given image. For our case, it is specified as a 2D convolution of the IFM for \( n \)-th input channel, i.e., an \((R \times C)\)-matrix \( X_n \), with a 2D \((K \times K)\)-filter \( W_{n,m} \); resulting in computing of an \((R \times C)\)-matrix \( Y_n \) for any \( n \in \{1 : N\} \) and \( m \in \{1 : M\} \). It is again assumed that an input matrix \( X_n \) is properly padded.

The 2D convolution can be viewed as the sum of the point-wise \((1 \times 1)\) convolutions computed for each input pixel in any order. The summation order can be common for all the pixels in an image by massively-parallel moving or sliding of the pixels across each other in a some predefined sequence which ensures an involvement in the summation of all needed pixels from the window-based area.

In terms of operational concurrency, this observation allows to implement an independent updating of all output pixels \( y(n) \in Y_n \) by sequentially supplying the weighted neighbor pixels \( x(i) \in X_n \) from the \((K \times K)\)-window to the central to this window pixel \( x(0) \) which can be any input pixel \( x(i) \in X_n \) as shown in Fig. 2(a). It can be effectively implemented by cyclically shifting or systolically rolling the image along a Hamiltonian path which defines a neighbor-pixel-by-pixel delivery in some linear order [26, 27].

†The IFM edge pixels should not be lost for further reusing.
3.2 SIMO Channels 2D Convolution

It is clear from the Eq. (1) that an IFM for the same input channel \( n \in \{1 : N\} \) is reused for all \( M \) output channels, i.e., the same \((R \times C)\)-matrix \( X_n \) is used for updating an output \((R \times C \times M)\)-tensor \( Y_{1:M} \). This updating requires to use also a corresponding \((K \times K \times M)\)-tensor of filter coefficients \( W_{\text{a:1:M}} \) as shown in Fig. 2(b). This vector-parallel style of computing (1) is implemented on the \((R \times C)\)-array of \( M \)-element vector MAC (VMAC) units with a torus interconnect. Such organization allows to compute a SIMO 2D convolution in also \( K^2 \) time-steps while producing not planar \( RC \), but volume of \( RCM \) intermediate MAC results per time-step [28]. Each of the \( RC \) VMAC units executes on each step a scalar-by-vector multiply-add operation \( \tilde{y}(\cdot) \) on each of \( K^2 \) time-steps, i.e., processing pattern for the SIMO 2D convolution is
• **Broadcast** \{M-vector \( \hat{w} \);
• **Update** \((R \times C \times M)\)-tensor \( Y_{1:M} \);
• **Roll** \((R \times C)\)-matrix \( X_n \).

3.3 MIMO Channels 2D Convolution

The SIMO 2D convolution can be naturally used to compute the MIMO 2D convolution by iteratively updating an OFM tensor \( Y_{1:M} \) for all the input channels \( n \in \{1 : N\} \) [28]. For every new input channel it can be done by keeping a previously computed output tensor \( Y_{1:M} \) on the \((R \times C)\)-array of vector registers \( \hat{y}(i) \) while changing an IFM matrix \( X_n \) and use a corresponding tensor \( W_{n,1:M} \) to implement the next SIMO convolution. Note that an order of the input channel selection is not required to be fixed.

3.4 Block Convolution

The existing limitations of memory and computing resources forced to divide the \((R \times R)\) IFM into an array of \((r \times r)\) tiles, process each tile independently, and combine partial solutions into the final \((R \times R)\) OFM \(^\dagger\). However, implementation of a kernel-wise convolution leads to an issue in computing near the edges of the tiles due to inter-tile dependencies. To address this, each individual tile should be slightly overlapped or padded by providing supplementary data at the boundary. The tile padding size is also setting as \( p = (K - 1)/2 \). We call this padded tile of pixel data as a block. A 3D \((r+2p) \times (r+2p) \times N\) tensor of the IFM blocks for all \( N \) input channels, which is called a pillar, provides all needed data to compute a single \((r \times r)\)-tile of the OFM.

In terms of a tile padding, there are at least three possible methods: zero or constant padding, repeated or mirrored padding and overlapped padding. Zero padding pads the boundary pixels with zeros while repeated padding duplicating the boundary pixels outwards based on the coherency of the pixels in an image. For these two methods there is no need neither explicitly pad each tile nor rearrange memory pattern on hardware, because block padding can be merged into process of convolution either by initialization or by manipulating memory addressing, which is a memory-efficient approximation of original convolution [29].

The selected here overlapped padding is based on replication of the boundary pixels from the neighbor tiles such that it solves the inter-tile data dependency problem. The result of this block overlapped 2D convolution is exactly the same as non-blocked convolution of the original image, but requires additionally data rearrangement in the memory [30, 31].

### 4. Accelerator’s Architecture and Area-Time Analysis

#### 4.1 A generic architecture of the TPE

An architecture of the accelerator can be directly obtained from the above analysis of the block multi-channel 2D convolution. This analysis covers the algorithm’s complexity, operational concurrency, different ways of data reusing, blocking, etc. A generic architecture of the CONV accelerator which we call the Tensor Processing Engine (TPE) is shown in Fig. 3. The TPE is connected to the host computer which implements other than CONV operations with low data reusing like pooling, quantization, activation, data management, etc., in consideration of Amdahl’s Law to properly support fast compute acceleration of the TPE.

The host connects to the TPE through a common off-chip memory and, besides, the TPE has its own on-chip memory to store blocks of the input and output data which are needed to concurrently compute an \((m \times r \times r)\) intermediate elements of the \((M \times R \times R)\) OFM tensor data, where the vector length \( m \in \{1 : M\} \) and the array size \( r \in \{K : R\} \).

The on-chip distributed memory with a parallel read/write data access is directly connected to the \((r \times r)\)-array of Vector MAC (VMAC) units. Each VMAC unit has the scalar and \( m \)-way vector register file (RF) to implement scalar-by-vector MAC operation \( \hat{y}(i) \leftarrow x(i) \cdot \hat{w}(i) + \hat{y}(i) \). It is clear that this architecture is based on the three levels of memory hierarchy: off-chip memory, on-chip memory and array of the RFs.

The scalar registers of the MAC array, storing initially a block of \( x()\)-operands from the IFM \( X_m \), are interconnected by torus network (see Fig. 3 where wrap-around connections are not shown for clarity). The \( m\)-element vector of weights \( \hat{w}(i) \) is replicated by broadcast from the on-chip memory to all the \( m\)-vector \( \hat{w}(i)\)-operand registers of the MAC array. This global one-to-all network is shown in Fig. 3 in the red color.

A vector-parallel computing of \((m \times r \times r)\) intermediate elements of the \((M \times R \times R)\) output tensor data directly follows an approach described in Section 3 which uses a “broadcast-compute-roll” processing pattern on each time-step. A proposed \( m\)-vector \( \hat{y}(i)\)-stationary approach is based on an accumulation of \( m\)-partial sums (PSUM) inside each VMAC unit (shown in Fig. 3 in the green color).

#### 4.2 Computing of the OFM “Brick” and “Pillar”

It is clear that SIMO-computing of an \((m \times r \times r)\) output data

\(^\dagger\)For simplicity, we assume \( R = C \) and \( r \) is the size of a tile.
We call an initial tensor. This SIMO-computing of the \( n \) for a single input channel is finished in time-steps to sequential (serial) computing with a single MAC unit, i.e.,

\[ T_1 = R^2N^2K^2M \]  

(time-steps are needed for computing a given CONV layer under assumption that a single three-read-one-write MAC operation is executed in one time-step or one clock cycle \( \tau_1 \).

On the other hand, as it was shown above, the number of time-steps to compute a CONV layer on the TPE with the shape of \( (m \times r \times r) \) is

\[ T_{mr^2} = K^2N[\frac{M}{m}]\left[\frac{R}{r}\right]^2. \]  

Here, all the \( mr^2 \) MAC results are computed in one clock cycle \( \tau_{mr^2} \) with a deep data reusing. More specifically, a single \( m \)-element vector of weights \( \hat{w}(\cdot) \) is reused spatially by broadcast over an \( (r \times r) \) array of VMAC units while all elements \( x(\cdot) \) of the input \( [(r + K - 1) \times (r + K - 1)] \)-block are reused temporally during \( K^2 \) time-steps through data rolling or cyclical shift across an array of VMAC units.

The speedup of a vector-parallel computing of CONV layer on the TPE over a corresponding sequential execution can be expressed as

\[ S_{mr^2} = T_1 \frac{T_{mr^2}}{T_{mr^2}} = \frac{M}{[\frac{M}{m}]} \left( \frac{R}{[\frac{R}{r}]^2} \right) \leq mr^2, \]  

time-step or clock cycle for vector-parallel computing \( \tau_{mr^2} \) can be perceptibly less than that of sequential computing \( \tau_1 \). Moreover, for both of serial and parallel implementations, due to the different cost of data accesses in memory hierarchy [24] as well as different degrees of operation concurrency and data reusing, the clock periods of execution likely to be dissimilar for various MAC operations. These are the main reasons in discrepancy between the amount of MAC operations and real speed of processing which has been noticed in previous works [32, 33]. We, however, assume, for the simplicity of comparison that all MAC operations are similar and that \( \tau_1 = \tau_{mr^2} \).

Note also that although the number of MAC operations does not directly translate to the CONV layer runtime [34], it is indicative of the runtime and allows for comparing CNN cost independent of hardware and associated software implementation details [35].

4.3.2 Convolution via Systolic Matrix Multiplication

To use highly-optimized in software and hardware a GEneral Matrix Multiplication (GEMM), many deep learning
frameworks convert a multi-channel 2D convolution (1) to this basic operation [6, 7, 32]. In this case, reshaping of the output tensor $Y_{R \times R \times M}$ into a matrix $Y_{R^2 \times M}$ can be done as follows (see, e.g., [36])

$$Y(m, r, c) = Y(m, r + R(c - 1)).$$

Reshaping of the input tensor $X_{R \times R \times N}$ into a matrix $X_{R^2 \times K^2 \times N}$ is implemented as

$$X(n, r + i - 1, c + j - 1) = X(r + R(c - 1), i + K(j - 1) + K^2(n - 1))$$

and, finally, reshaping the kernel tensor $W_{K \times K \times N \times M}$ into a matrix $W_{K^2 \times N \times M}$ is done by

$$W(m, n, i, j) = W(m, i + K(j - 1) + K^2(n - 1)).$$

Using these matrices, the multi-channel convolution (1) can be computed as a matrix-by-matrix multiply-add

$$Y_{R^2 \times M} = X_{R^2 \times K^2 \times N} \times W_{K^2 \times N \times M} + Y_{R^2 \times M}, \tag{6}$$

where, initially, a matrix $Y_{R^2 \times M}$ is either a zero matrix or an output matrix from the previous layer.

The 3D index space of a computing (6) as well as surrounding input/output matrices are shown in Fig. 5(a). Because each index point in a 3D $K^2 \times N \times M$ computational index space is associated with a single MAC operation, the total number of such operations to compute (6) is the same as for the direct computing of (1). However, a matrix multiply-add approach requires $K^2$ more elements in an input matrix $X$, i.e., more memory accesses would be needed.

The 2D array processors as the different sets of MAC units as well as the initial data flows for matrices $X, W$ and $Y$ can be formally obtained by linear projections of a 3D index space into a 2D processor space [16, 37–39]. Figures 5(b), (c), (d) demonstrate three such projections of a given 3D index space along the RC-edge, M-edge, and $K^2N$-edge for the so-called weight-stationary, input-stationary, and output-stationary variants of the planar array processors, respectively.

For the systolic time-space scheduling of the MAC operations and data movement inside the 3D index space, the total number of time-steps or clock cycles to implement computing (6) is equal to the length of a critical path, i.e., $K^2N + R^2 + M - 2$, which is preserved for all the 3D→2D projections [40, 41]. However, the startup delay and computing latency, measured also in the time-steps, will be different for the different projections. Indeed, for the weight-stationary projection, a startup delay is $K^2N + M - 2$ while computing latency is $R^2$; for the input-stationary, a startup delay is $R^2 + K^2N - 2$ and computing delay is $M$; for the output-stationary, a startup delay is $R^2 + M - 2$ and computing latency is $K^2N$.

The all three systolic array processors have the same potential speedup of computing

$$S_{\text{systolic}} = K^2N R^2 M / (K^2 N + R^2 + M - 2),$$

but different efficiency or utilization levels: $R^2 / (K^2 N + R^2 + M - 2)$ for the weight-stationary variant, $M / (K^2 N + R^2 + M - 2)$ for the input-stationary, and $K^2N / (K^2 N + R^2 + M - 2)$ for the output-stationary variant. Note that our proposal also in the case of unlimited parallelism has the potential speedup

$$S_{\text{direct}} = R^2$$

and, therefore, it has the highest possible efficiency of 100%.

For the CNN VGG16 (see Table 1 below), Fig. 6 shows the speedup ($a$) and efficiency ($b$) of the different systolic and direct implementations of (1). As it can be seen, the speedup of a direct computing is always higher than any systolic implementation, Fig. 6(a), while an efficiency is sufficiently varied not only for the differently projected systolic arrays but also for the different CNN layers, Fig. 6(b). Indeed, an efficiency of the weight-stationary systolic array processor is changing from 99.8% for the first CONV layer to 3.7% on the last three CONV layers, while for the output-stationary array processor, an efficiency is varied from 0.05% at the beginning to 86.7% at the end of processing. The input-stationary variant never reaches a computing efficiency higher than 10%.

Obviously, to keep a high efficiency of the systolic VGG16 computing, it is possible to use a weight-stationary systolic array processing from the beginning until the conv3_3 layer and then switch to the output-stationary systolic processing. But even in this combined case, due to the inherent startup delay, a systolic array processing will never reach the highest possible efficiency of computing as in the direct implementation.

4.4 Selection of TPE’s shape for a single CONV layer

As it can be seen, the selection of a proper vector-parallel ($m \times r \times r$)-shape of the 3D TPE for a single CONV layer is crucial with respect to the achievable levels of performance (shown as Eq. (4)) and data reusing (shown, partially, as an efficiency). Fig. 7 exhibits a solution space for selection of the TPE’s shape for a single CONV layer. The three cases are shown: (a) a serial implementation on the TPE with one scalar MAC unit, i.e., when $m = r = 1$ (in the blue color); (b) an extremely vector-parallel implementation when the vector length $m = M$ and the array size $r = R$ (in the red color); (c) a tiled vector-parallel implementation for the vector length $m < M$ and the array size $K \leq r < R$ (in the green color).

An area $A_{r,m}$ of each rectangle with an ($m \times r \times r$) TPE’s shape is proportional to the number of time-steps required for this specific implementation of the CONV layer. Obviously, for the extreme cases (a) and (b) we have an equality $A_{1,1} = A_{R,M}$. However, for the case (c), due to a "ceiling" function $\lceil x \rceil = \min\{n \in \mathbb{Z} | n \geq x\} = n$ in Eq. (4), an area $A_{r,m}$ may be greater than $A_{1,1} = A_{R,M}$, i.e., the TPE with this shape may compute a number of unnecessary or redundant MAC operations. The number of these operations can be estimated as $A_{1,1} - A_{r,m}$. The lower computing efficiency means that more of such unnecessary MAC operations are executed.

It is important to note that if the "ceiling" functions $\lceil x \rceil$ in Eq. (4) do not have remainders, concurrently for both $x =$
masked and isolated from the main data processing path, affect the speedup and efficiency.

The redundant MAC operations can be traditionally handled to optimize the speedup of computing on the TPE. In this case, the speedup of computing on the TPE with such a shape is $S_{m/r^2} = m/r^2$ with the maximal possible efficiency of computing $E_{m/r^2} = 1$. Otherwise, the TPE executes the redundant MAC operations which will affect the speedup and efficiency.

Traditionally, the redundant MAC operations can be masked and isolated from the main data processing path, but it would require additional logic, control and energy consumption. A masking of the vector MAC operations is relatively simple and straightforward, but its implementation in the fixed VMAC array might not be trivial because this array executes not only computing, but also supports data reusing by cyclical data shift across locally interconnected VMAC units.

### 4.5 Selection of a single TPE’s shape for all CONV layers

The selection of a proper TPE’s shape becomes even more complex under a run-time variety of the CONV layer sizes in the same CNN model. In the existing state-of-the-art CNNs all the CONV layer parameters ($R, N, M, K$) can be changed from layer to layer and the number of such layers can be from a few to many tens and even hundreds. In our case, a dynamic vector length ($m$) and array size ($r$) adjustment by the run-time TPE reconfiguration requires the additional processing/memory logic, time and power which will affect a computing efficiency. In existing CNNs the spatial dimension ($R$) is gradually shrunk while the channel

---

**Fig. 5** The 3D space of a matrix-by-matrix multiply-add $(a)$ and its corresponding linear projections onto a 2D array processor space along $RC$-edge $(b)$, $M$-edge $(c)$, and $K^2N$-edge $(d)$.

**Fig. 6** Computing speedup $(a)$ and efficiency of the direct and different systolic implementations $(b)$.

**Fig. 7** An area-time solution space of the scalable TPEs.
The total number of MAC operations shows simultaneously for parallel, but not vector, execution of each layer required for processing a single CONV layer \((l)\) is computed as

\[
T_{1,l,r}^* = \sum_{\ell=1}^{L} T_{1,l,r}^{(\ell)},
\]

where \(L\) is the total number of CONV layers \((L = 13\) in the VGG16\) and \(T_{1,l,r}^{(\ell)} = R(1)N(l)K^2(l)M(1)\) is a sequential or serial time complexity of the layer \(l\).

Now we can calculate the number of time-steps

\[
T_{l,r}^{(\ell)} = K^2(l)N(l)M(l)[R(l)/r^*]^2
\]

required for parallel, but not vector, execution of each layer \(\ell \in \{1, 2, ..., L\}\) on the planar TPE with an \((1 \times r \times r)\)-shape, i.e., \(m = 1\), for the all possible array sizes \(K \leq r \leq R_{\text{max}}\), where \(R_{\text{max}} = 224\) for our case (see Table 1).

After that the total number of time-steps required for execution of all CONV layers on the same planar TPE can be computed as \(T_{1,r}^* = \sum_{\ell=1}^{L} T_{1,l,r}^{(\ell)}\). Then the speedup \(S_{l,r} = T_{1,r}^* / T_{1,l,r}^* \leq r^2\) and efficiency \(E_{l,r} = S_{l,r}/m(r)^2 \leq 1\) for TPE with an \((m \times r \times r)\)-shape are calculated. Then, for this final part, the optimal array size \(m^*\) is selected for both the efficiency of processing \(E_{m^*,r^*} = \max_{m} \{E_{m,r}\}\) and time of computing \(T_{m^*,r^*} = \min_{m} \{T_{m,r}\}\). For example, for the VGG16, as it can be seen from the Table 2, the value \(m^* = 64\), which is a GCD for all sizes of the output channels \(M\), drastically increases a MAC concurrency without any efficiency penalty.

Table 2 shows the parameters for a few popular CNNs including the number of time-steps needed for serial CONV layers execution \(T_{1}^*\), the optimal as well as suboptimal vector length \(m^*\) and array size \(r^*\), the number of time-steps \(T^*\) required for a vector-parallel processing on the TPE with an \((m^* \times r^* \times r^*)\)-shape, and the corresponding speedup \(S^*\) and efficiency \(E^*\). Note that to satisfy possible hardware limitations, such as the size of VMAC array \((r \times r)\) and the vector length \((m)\), it is feasible to scale a TPE’s shape not only up, but also down while keeping a reasonable computing efficiency of vector-parallel processing.

As it can be seen from the Table 2 there is the optimal or suboptimal vector-parallel TPE shape which can be very efficiently used for considerable acceleration of computing all CNN layers. Moreover, due to an image inter-block independency, it is possible to use a cluster of such highly-efficient
TPEs for the additional CNN acceleration. Furthermore, the single TPE or cluster of TPEs can also be utilized for efficient vector-parallel execution of the different CNN models which are collectively used in some important applications such as self-driving cars [3, 20].

5. Hardware Implementation

We implemented a CNN accelerator based on a single TPE on Xilinx Kintex KU5P FPGA board and evaluated the performance, power efficiency, and computational efficiency for PCAS/SSD [44]. The block diagram of our CNN accelerator is shown in Fig. 8. A TPE consists of \( r \times r \) VMAC units with mesh interconnect. Each VMAC is composed of \( m \) scalar MAC units. blue Each edge VMAC unit has additional registers to store halo elements. Moreover, it has additional registers to avoid long wrap-around torus interconnect. In this implementation, we set \( r \) and \( m \) as 19 and 11, respectively, which are selected based on the analysis of Table 2.

The bit widths of activations and weights are 4 bits and 2 bits, respectively, while the bit width of accumulation registers is 12 bits. We implemented four MAC units using a DSP block of KU5P. These bit widths are determined by the evaluation using all test images from the PASCAL VOC 2007 dataset. We investigated the accuracy degeneration due to quantization and the overflow of each convolutional layer computation of PCAS/SSD with 2 bits for weight, 4 bits for activation, and 12 bits for the accumulator. Table 5 shows the mean average precision (mAP) before and after the application of quantization and PCAS. The overflow occurred only in the conv6 layer of PCAS/SSD, and the number of times of overflow occurred in PCAS/SSD computing was only three. The accuracy degradation after quantization and channel reduction was 1.5%.

The Data Manager has Xin, Win, Yout, First-In-First-Out (FIFO) for data transfer from/to an external memory via the Direct Memory Access (DMA) module with a width of 512 bit. An Xin FIFO stores a block of the input feature map data, and the word size of Xin FIFO is \( (r+K-1) \times (r+K-1) \times 4 \), which is for \( K = 3 \) is 1764 bits. The Xin FIFO receives a pillar of an input feature map with DMA read instruction. Until executing FIFO Flush instruction, the pillar is reused inside the FIFO. In other words, we can read the first channel of the pillar after we read the last channel. Note that data reusing of FIFO is realized by changing the address pointer of BRAM to avoid redundant reads and writes. A Win FIFO stores parameter and the word size of Win FIFO is \( 2m = 22 \) bits. We place 24 Win FIFOs in parallel so that Win FIFOs can receive data from DMA every clock. The Win FIFOs receive all parameters of a convolutional layer with DMA read instruction, which can reuse data as in Xin FIFO. Yout FIFO is to store bricks of output feature map calculated by TPE.

The micro controller unit (MCU) can execute DMA data read/write, loop control, and TPE execution in parallel based on the 32-bit instruction code. The TPE can reduce the waiting time for data transfer and achieve high TPE utilization by data reusing in Xin and Win FIFOs and parallel execution of data transfer and computing. Our instruction set consists of type-C, type-S, type-L, and type-H instructions. The type-C is for computing a multi-channel tensor convolution. BRICKIS of type-C loads a channel of a pillar of Xin FIFO, computes the convolution while broadcasting \( m \)-vector of weights, and stores the brick to the Yout FIFO iteratively. The type-S is for reading from and writing to the DDR memory. SETST sets the address given in 20-bit operand to address pointer specified by 3-bit operand, and ADDST increments the specified address pointer by the value given in 20-bit operand. STINR reads a pillar of the address pointer from the DDR memory and pushes it to the specified FIFO. This pillar can be reused until the specified FIFO is cleared by FLUSH instruction. Similarly, STOUT writes a specified number of bricks of the specified FIFO to the DDR memory. In type-L, LSET sets the value to the specified counter and LOOP decrements the specified counter and goes \( n \) instructions back where \( n \) is given by 16-bit operand if the specified counter is larger than zero. In type-H, HALT stops the program counter until receiving an awake signal. In Fig. 9, we show a code example for calculating convolution for a pillar. In this code example, NBP is the number of bricks in an output pillar, NMP is the number of 4KB memory blocks in an input pillar, NC is the number of channels, KS is the kernel size, NMB is the number of 4KB memory blocks in a brick. This code executes 19-19-11·KS·KS·NC MAC operations in almost KS·KS·NC clocks. Note that STINR reads the next pillar from the DDR memory while BRICKIS computes convolution for the current pillar. We developed a compiler for the proposed hardware implementation. This compiler receives a neural network model that are compressed and quantized by PCAS and generates instruction codes. Those codes with a nested loop are loaded to the TPE’s instruction memory shown in Fig. 8.
Table 4  Implementation Data

<table>
<thead>
<tr>
<th>Device</th>
<th>CNN Model</th>
<th>Reference</th>
</tr>
</thead>
<tbody>
<tr>
<td>KUSP</td>
<td>PCAS/SSD</td>
<td></td>
</tr>
<tr>
<td>LUT</td>
<td>120,035</td>
<td></td>
</tr>
<tr>
<td>FF</td>
<td>84,566</td>
<td></td>
</tr>
<tr>
<td>DSP</td>
<td>1,088</td>
<td></td>
</tr>
<tr>
<td>Max frequency [MHz]</td>
<td>183</td>
<td></td>
</tr>
<tr>
<td>Frequency [MHz]</td>
<td>180</td>
<td></td>
</tr>
<tr>
<td>Peak performance [GOPS]</td>
<td>1,430</td>
<td></td>
</tr>
<tr>
<td>Throughput [GOPS]</td>
<td>1,265</td>
<td></td>
</tr>
<tr>
<td>Power [W]</td>
<td>2.27</td>
<td></td>
</tr>
<tr>
<td>Power efficiency [GOPS/W]</td>
<td>629.5</td>
<td></td>
</tr>
</tbody>
</table>

Table 5  MAC concurrency per clock cycle and efficiency of MAC array.

<table>
<thead>
<tr>
<th>Accelerator</th>
<th>b</th>
<th>Efficiency</th>
<th>CNN Model</th>
<th>Reference</th>
</tr>
</thead>
<tbody>
<tr>
<td>Eyeriss</td>
<td>168</td>
<td>36%</td>
<td>VGG-16</td>
<td>[45]</td>
</tr>
<tr>
<td>ConvAix</td>
<td>192</td>
<td>76%</td>
<td>VGG-16</td>
<td>[46]</td>
</tr>
<tr>
<td>FSD</td>
<td>9,216</td>
<td>80%</td>
<td>Inception-v4</td>
<td>[20]</td>
</tr>
<tr>
<td>TPUv1</td>
<td>65,536</td>
<td>11% / 22%</td>
<td>CNN0 / CNN1</td>
<td>[47]</td>
</tr>
<tr>
<td>TPUv2</td>
<td>32,768</td>
<td>87% / 44%</td>
<td>CNN0 / CNN1</td>
<td>[48]</td>
</tr>
<tr>
<td>TPUv3</td>
<td>65,536</td>
<td>65% / 25%</td>
<td>CNN0 / CNN1</td>
<td>[48]</td>
</tr>
</tbody>
</table>

The implementation result of our CNN accelerator is shown in Table 4. The CNN Accelerator achieved a throughput of 88.5% of the peak performance, which is close to the optimal efficiency of PCAS explained in Section 4.5.2. Moreover, we achieved a power efficiency of 629.5 GOPS/W. We also show the theoretical estimation and real efficiency for each layer of PCAS/SSD in Fig. 10. The degeneration of real efficiency compared with theoretical analysis is caused by data transfer from/to external memory. However, the efficiency of our CNN accelerator is comparable with theoretical analysis in most layers. Because the data reusing rate decreases in layers where the number of input channels is relatively small, the efficiency is decreased in such layers. Moreover, the block sizes of conv9_1 to conv11_2 are 10 × 10 to 3 × 3. As a result, our TPE of 19 × 19 × 11 executes many redundant MAC operations for these layers. Therefore, the efficiency of these layers is low. On the other hand, we achieve high efficiency of 88.5% across all CONV layers of PCAS/SSD.

We show the efficiency of other CNN accelerators in Table 5. The efficiencies for the all versions of TPs are estimated by using the referenced Roofline models. As it is mentioned in [47], CNN0 is AlphaZero, a reinforcement learning algorithm with extensive use of CNNs, which mastered the games chess, Go, and shogi while CNN1 is a Google-internal model for image recognition. A relatively low computational efficiency or utilization (33%) of the latest Google’s TPU4i across the eight most utilized at Google models is also mentioned in [17]. As it can be seen our implementation and theoretical analysis of TPE demonstrate a computational efficiency that is higher or comparable with state-of-the-arts practical accelerators.

For the VGG16 model, the proposed accelerator with a TPE of 19 × 19 × 11 achieves an efficiency of 71.0% while the theoretical efficiency is 72.5%. Even though the TPE is not optimized for the VGG16 model (see Table 2), it achieves higher efficiency than Eyeriss. Also, it is comparable with ConvAix, while the number of MACs/cc is much higher. Moreover, if we select r and m optimized for VGG16, we can achieve close to the maximum efficiency and performance.

6. Conclusions and Future Work

The scalable algorithm and proper single shape 3D vector-parallel architecture of the Tensor Processing Engine for efficient layer-to-layer computing of the multi-channel 2D convolutions are proposed and evaluated. An optimal accelerator’s shape maximizes the number of concurrent MAC operations per clock cycle while minimizes the number of redundant operations. The proposed 3D vector-parallel TPE architecture with an optimal shape can be very efficiently used for considerable acceleration of the CNN computing. Due to supported inter-block image data independency, it is possible to use multiple of such highly-efficient TPEs for the additional computing acceleration.

It is important to note that a possible limitation of our approach which is mainly based on the involvement in a computing all the available tensor data and, therefore, a very high memory bandwidth is required. However, we can see a few possible scenarios to soften or even eliminate this limitation.

Firstly, because of all major CNN models are computed iteratively, layer-by-layer, i.e., an output data of one layer is an input data for the next layer, it is possible to keep this data
inside an array processor without sending it to the memory by implementing an output-stationary variant of computing including the CNN layers fusion [49]. Moreover, this input and output data are originally the 3D tensors and proposed a 3D array processing keeps these tensors in a natural order without destroying an integrity of data.

Secondly, an initial input of the practically big multi-channel data from the image sensors, actuators, etc., affects only the first CONV layer in a CNN model. A processing of this layer might be tightly coupled with a tensor data acquisition, e.g., by fusion of the array processor with a smart or intelligent image sensor [50] which has a highly-parallel pixels read-out [51].

And last but not least, a blocking of the input multidimensional data tensor (see Subsection 3.4) divides a convolution problem into the set of totally independent subproblems with the smaller size reflecting the hardware limitations such as admissible operational concurrency, memory size and bandwidth, I/O capabilities, etc. Computing of each such subproblem can be efficiently overlapped with an input/output of relatively small data tensors.

Note also that a possible extension of the proposed vector-parallel computing to other less compute-intensive types of CNN operations such as activation, pooling, quantization and others, which are currently assumed to be executed on the host computer, can be effectively combined and implemented inside a proposed VMAC array, are considered as a future work.

7. Acknowledgement

This paper is based on results obtained from a project, JPNP16007, commissioned by the New Energy and Industrial Technology Development Organization (NEDO). We thank Dr. Yukoh Matsumoto (TOPS Systems Corp.) for a great help in porting the TPE to FPGA. We also thank the anonymous reviewers for their many insightful comments and suggestions.

References


Stanislav Sedukhin received his Ph.D. and Dr.Sc. (habilitation) from the Russian Academy of Sciences in 1982 and 1993, respectively. From 1993 he joined the University of Aizu where he was a professor, Dean of Graduate School and Vice-President. Dr. Sedukhin research interests are in architectural synthesis of application-specific array processors and massively-parallel algorithms. Currently, he is Professor Emeritus of the University of Aizu.

Yoichi Tomioka received the B.E., M.E., and D.E. degrees from the Tokyo Institute of Technology, Tokyo, Japan, in 2005, 2006, and 2009, respectively. He was a Research Associate with the Tokyo Institute of Technology until 2009. He was an Assistant Professor with the Division of Advanced Electrical and Electronics Engineering, Tokyo University of Agriculture and Technology until 2015. He was an Associate Professor with the School of Computer Science and Engineering, University of Aizu until 2018. Since 2019, he has been an Senior Associate Professor with the university. His research interests include image processing, hardware acceleration, high-performance computing, electrical design automation, and combinatorial algorithms. He is a member of IPSJ and IEEE.
Kohei Yamamoto  Kohei Yamamoto received his B.E., and M.E. degrees from Department of Management Systems Engineering, Chuo University, Tokyo, Japan, in 2012 and 2014, respectively. He is currently a research engineer at Oki Electric Industry Co., Ltd. working on developing machine learning algorithms.