disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · mac is...
TRANSCRIPT
![Page 1: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/1.jpg)
저 시-비 리 2.0 한민
는 아래 조건 르는 경 에 한하여 게
l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.
l 차적 저 물 성할 수 습니다.
다 과 같 조건 라야 합니다:
l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.
l 저 터 허가를 면 러한 조건들 적 되지 않습니다.
저 에 른 리는 내 에 하여 향 지 않습니다.
것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.
Disclaimer
저 시. 하는 원저 를 시하여야 합니다.
비 리. 하는 저 물 리 목적 할 수 없습니다.
![Page 2: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/2.jpg)
Project Report of Master of Engineering
Design of a Low-Power and High
Performance MAC for CNNs
CNN을 위한 Low-Power 와 High Performance
MAC 설계
February 2020
Graduate School of Engineering Practice
Seoul National University
Department of Engineering Practice
Seungwan Baek
![Page 3: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/3.jpg)
Design of a Low-Power and High
Performance MAC for CNNs
Prof. Hyuk-Jae Lee
Submitting a Master’s Project Report
February 2020
Graduate School of Engineering Practice
Seoul National University
Department of Engineering Practice
Seungwan Baek
Confirming the Master’s Project Report written
by
Seungwan Baek
February 2020
Chair Cheol Seong Hwang (Seal)
Examiner Hyuk-Jae Lee (Seal)
Examiner Woo-Young Kwak (Seal)
![Page 4: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/4.jpg)
i
Abstract
State-of-arts Convolutional Neural Network (CNN) has more
convolution layers than those in the past to increase the accuracy of
classification and super-resolution. Many researches have focused
on reducing network size to save the computational cost with
keeping high accuracy, and studied to optimize a convolution layer
itself to reduce computational cost. This paper proposes
approximate computing using novel 4-2 compressors and applies
on Baugh Wooley and Booth multiplier. Convolution layers in CNNs
consist of multiply- and -accumulate (MAC). We applied the
approximate multiplier into the modified MAC for high efficient Field
Programmable Gate Array (FPGA) resource utilization. As results,
the proposed approximate compressors show 11.5% and 29.6% less
area-delay product (ADP) and area-power product (APP)
respectively than the previous work design. Finally, the modified
MAC is implemented VDSR hardware to compare output images
with reference and resulting 37.6dB with PACD 2 on Booth
multiplier
Keyword: CNN; FPGA; MAC, Multiplier; Approximate Compressor
Student Number: 2018-28454
![Page 5: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/5.jpg)
ii
Table of Contents
Chapter 1. Introduction ..................................................... 1
1.1 Study Background ......................................................... 1
Chapter 2. Backgrounds .................................................... 5
2.1 CNN Architecture .......................................................... 5
2.2 CNN Hardware Architecture ........................................ 6
Chapter 3. Proposed MAC for CNN ................................... 9
3.1 Exact 4-2 Compressor ................................................ 9
3.2 Proposed 4-2 Compressor Design 1 ...................... 12
3.3 Proposed 4-2 Compressor Design 2 ...................... 14
3.4 Unsigned Dadda Tree Multiplier .............................. 18
3.5 Signed Modified Baugh Wooley Multiplier .............. 19
3.6 Signed Radix-4 Booth Multiplier ............................ 20
3.7 A Modified MAC ........................................................ 22
3.8 VDSR Hardware Structure ....................................... 23
Chapter 4. Evaluation Results ........................................ 25
4.1 Approximate Compressors ....................................... 25
4.2 Approximate Compressors in Multipliers ............... 25
4.3 Error Analysis ........................................................... 27
4.4 Multiplier Comparison in CNN application .............. 28
Chapter 5. Conclusion .................................................... 32
Chapter 6. Discussion .................................................... 33
![Page 6: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/6.jpg)
1
Chapter 1. Introduction
1.1 Study Background
Convolutional Neural Network (CNN) was invented by Y. LeCun
et al. [1, 2], and becomes more popular recently because it has
strength for image recognition, classification, speech recognition etc.
than other deep learning algorithms. It can be divided into two parts.
The first part is training. For example, in order to classify image, it
trained each image features of the existing image in the database.
The second part is classification. It analyzes a new input image
feature and try to match with training data. Then it classified what
the image is. We will explain more details of CNN in the next
section. It is important to classify the image accurately. In order to
increase the accuracy of recognition, CNN needs to have more
convolution layers. For example, AlexNet in Fig. 1, which has five
convolution layers, won the first prize at ImageNet Large Scale
Visual Recognition Challenge (ILSVRC) achieving a top-5 error of
15.3% in 2012 [3]. However, ResNet, which won the ILSVRC 2015
expanded the number of layers to 152 and achieved the Top-5
error of 5.71% [4]. Table Ⅰ shows the Top-5 error rate between
2012 and 2015 at the ILSVRC. It has dramatically decreased as the
number of layers is growing. This suggests that increasing the
number of convolution layers is essential for high accuracy in CNN
network. However, when CNNs are implemented in a hardware
system such as CPU, GPU and Field Programmable Gate Array
(FPGA), the hardware cost increases in proportion to the number of
layers. It is because convolution layers are computation intensive
whereas fully connected (FC) layers are memory intensive [5]. In
order to reduce the computational cost of CNNs, various approaches
have been studied. Y. Gal et al. [30] proposed Bayesian CNN with
small data. This is by placing a probability distribution over the
CNN’s kernel. X. Lin et al. [15] studied regarding binary CNN. The
values of weight and bias are constrained to {-1, +1} at run-time.
Therefore, it can reduce the memory size and computational cost
dramatically.
Convolution layers in hardware consist of multipliers and adders,
which are known as multiply and accumulate (MAC) operations.
![Page 7: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/7.jpg)
2
The large number of layers require long computation time as well
as many hardware resources. Therefore, there needs to find the
best trade-off between accuracy and time/cost.
Fig. 1. The AlexNet Architecture
Table I FROM 2012 TO 2015 ILSVRC TOP-5 ERROR RATE
Network Error Rate
(%)
Depth
(Number of
Layers)
AlexNet
(ILSVRC ’12) 15.3 8
VGG
(ILSVRC ’14) 7.3 19
GoogLeNet
(ILSVRC ’14) 6.7 22
ResNet
(ILSVRC ’15) 5.71 152
CPU and GPU are widely used hardware systems for deep
learning. Both are performance-oriented hardware architectures
[6]. Thus, it consumes a lot of power. In other word, energy-to-
performance efficiency is not good. The hardware accelerator is
one of the typical implementation types for deep learning along with
CPU and GPU. The off-chip hardware accelerator architecture is
illustrated in Fig. 2. CPU sends operation commands to the
hardware accelerator, and it communicates with the memory
directly to perform the tasks that CPU orders. While the hardware
accelerator is working on the tasks, CPU can perform the other
tasks.
![Page 8: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/8.jpg)
3
The hardware accelerator can be considered as various types
of hardware. Application-specific integrated circuit (ASIC) and
FPGA are most popular hardware accelerators. To use the ASIC as
the hardware accelerator is more efficiency than that of the FPGA.
However, the FPGA has more flexibility than that of the ASIC since
many other different types of applications can be implemented on
the FPGA. With these reasons, FPGA environments are getting
popular due to its superior energy-to-performance efficiency [7,
8]. Moreover, recent research works show that performance itself
is also comparable with CPU and GPU [9]. However, the very
limited resources are the main weak point of FPGA systems. For
MAC operations, Digital Signal Processing (DSP) and Look-Up
Table (LUT) blocks can be used in FPGA. Usually, only DSPs are
used for a × b multiplication. The number of MAC operations are
equivalent to the number of DSP usage in FPGA since it is parallel
operations. D. Nguyen et al. [29] proposed double MAC, which can
reduce the number of DSPs by half. However, it is only useful when
the number of bits are small such as 4 × 4 multiplication. If it
exceeds 27 × 18 (DSP48E2 in Xilinx ZCU102 FPGA board)
multiplication, two DSPs or one DSP and extra LUTs are required
for the calculation. Z.Zhang et al. [30] proposed by increasing the
number of MAC operation at the same time. When DSP resources
are not enough to cover all the multiplications required in the
network, LUTs will be used for the rest of multiplications, but it will
not be efficiently used.
Fig. 2. Off-Chip Hardware Accelerator Architecture
![Page 9: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/9.jpg)
4
A multiplier consists of three parts: partial product generation,
partial products reduction and final addition using carry-propagate
adders. Approximation can be applied in any part of these steps.
For example, in [16], approximation technique is implemented in
the partial product generation stage. Partial product perforation that
skips the generation of partial product is considered in [17]. They
implemented the technique in various multipliers and compared the
power consumption and error rate. Meanwhile, in partial products
reduction step, different types of adders, such as half/full adder and
4-2 compressor, are used. A large portion of the previous works
for approximate multipliers have focused on this step. Esposito et al.
[18] proposed the approximate half adder by simplifying a 2-1
compressor (half-adder) with an OR gate. With only 1/16 error
probability, the full adder can be simplified just using one AND gate
and two OR gates. 4-2 compressors are also commonly used to
reduce the partial product step more than half/full adders. Another
approach to approximate multiplier is introduced in [19]. This work
divided large multiplications into small blocks such as 2 × 2
multiplier block and built adjustable multipliers. Truncated method
is also widely studied. C. Chang et al. [20] used a multiplexer-
based truncated array multiplier to reduce power and area.
There is a trade-off between accuracy and computational cost,
such as power consumption and area, for the multiplier
implementation. Therefore, it is imperative to find the optimal point
between them. In this paper, low power and area efficient
multipliers are presented by proposing and applying new
approximate 4-2 compressors. Moreover, conventional MAC is
modified to calculate two operations at the same time. VDSR is used
as an application of CNN to confirm the effects of MACs.
![Page 10: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/10.jpg)
5
Chapter 2. Backgrounds
2.1 CNN Architecture
A CNN consists of input, output layers and hidden layers in
between input and output as shown Fig. 3. The hidden layers are
the key of CNN because it can affect the accuracy of outputs. The
hidden layers are series of convolution layers (CONV layer),
activation function such as rectified linear unit (ReLU), pooling
layers and fully connected (FC) layers.
Fig. 3. A Regular n-layer Neural Network
CONV layers compute the inputs, weights and biases. Weights
are also known as kernels. As shown in Fig. 4, when inputs pass
kernel, inputs are multiplied with weights then accumulated with
biases. It is matrix multiplications so that they are computed in
parallel. The number of outputs will be equal to the number of
kernels.
Fig. 4. Convolution Calculation
The outputs of kernels pass activation functions. ReLU is the
typical activation function that uses widely because it reduces
![Page 11: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/11.jpg)
6
likelihood of the gradient to vanish and it is sparsity. It makes the
negative values to zero and keep the positive values as explained
(1).
h = max (0 , y), where y = wx + b (1)
Another key concept of CNNs is pooling. It is non-linear
functions since it reduces the number of features. Max pooling is
the most commonly used. It takes out the maximum value among
the values in the filter. Fig. 5 shows how the max pooling is
operated.
Fig. 5. Max Pooling with a 2 × 2 filter and stride = 2
FC layers are output layers. After several CONV layers and max
pooling layers, it connects every neuron in one layer to every
neuron in another layer. The flattened matrix goes through a FC
layer to classify.
2.2 CNN Hardware Architecture
Fig. 6 shows an overview of CNN hardware architecture.
Weights are stored in external memory because the amount of the
data is huge. The first CONV layer receives input feature map data
and weights from external memory then compute MAC operations.
The results will be stored in buffer. Thus, it can reduce data
communication between CNN hardware and external memory. As
results, it also can reduce power consumption.
![Page 12: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/12.jpg)
7
Fig. 6. An Overview of CNN Hardware Architecture
CONV layers are well known as computation-intensive layers
while FC layers are known as memory-intensive layers [14]. This
is the reason why we focus on CONV layers to reduce
computational cost. On the other hand, FC layers are required tons
of data from external memory so that it is important to reduce the
number of weights.
Fig. 7. A Conventional MAC Structure
A conventional MAC structure is illustrated in Fig. 7. This is a
computation engine for CONV layer. The two n-bit inputs are
multiplied and became 2n bits. Here input a can be assumed as an
input feature map data of image and input b can be assumed as a
weight value from the kernel. The multiplied result will be added
with other results from the same kernel, and then they will be
accumulated by using adder. Once every operation is finished by
![Page 13: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/13.jpg)
8
using MAC, then the results will pass through ReLU and go to
pooling layer for downsizing as explained previous section.
![Page 14: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/14.jpg)
9
Chapter 3. Proposed MAC for CNN
3.1 Exact 4-2 Compressor
A compressor is widely used in computer arithmetic. The most
popular compressor is 2-1 compressor as known as full adder. It
receives two inputs and one Cin from previous module, then gives
two outputs, which are Sum and Carry. The 4-2 compressor is also
used widely and more often. In a basic 8 × 8 Dadda Tree multiplier
structure, there are 18 4-2 compressors, three half adders and
three full adders to get the result. This will be explained in detail in
the multiplier section.
(a) (b)
Fig. 8. (a) 4-2 Compressor with Serially Connected Two Half
Adder (b) 4-2 Compressor Using XOR and MUX [21]
The conventional exact 4-2 compressor uses two serially
connected full adders as illustrated in Fig. 8 (a). It counts the
number of ‘1’ from inputs. The function of the 4-2 compressor is
shown as following:
Sum + 2 ×(Cout + Carry) = X0 + X1 + X2 + X3 + Cin (2)
Here, X0, X1, X2, X3 and Cin are inputs, whereas Sum, Carry and
Cout are outputs. Cin receives its value from the previous stage of
one bit lower in significance, whereas Cout and Carry are passed to
the next stage of one bit higher in significance. This indicates that
Cout and Carry are one bit higher weight than that of Sum. Thus,
![Page 15: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/15.jpg)
10
many previous studies have been focused on how to deal with these
carries since it is related to the cell delay. In Fig. 8 (b), the
optimized exact 4-2 compressor [21] is illustrated having 4 XOR
gates, and it has a critical path delay of 3Δ by assuming an XOR
gate has the same delay as an AND gate. The truth table of an exact
4-2 compressor with 32 states is shown in Table Ⅱ.
![Page 16: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/16.jpg)
11
Table II The Truth Table of Exact 4-2 Compressor
Cin X0 X1 X2 X3 Cout Carry Sum
0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 1
0 0 0 1 0 0 0 1
0 0 0 1 1 0 1 0
0 0 1 0 0 0 0 1
0 0 1 0 1 0 1 0
0 0 1 1 0 1 0 0
0 0 1 1 1 1 0 1
0 1 0 0 0 0 0 1
0 1 0 0 1 0 1 0
0 1 0 1 0 1 0 0
0 1 0 1 1 1 0 1
0 1 1 0 0 1 0 0
0 1 1 0 1 1 0 1
0 1 1 1 0 1 0 1
0 1 1 1 1 1 1 0
1 0 0 0 0 0 0 1
1 0 0 0 1 0 1 0
1 0 0 1 0 0 1 0
1 0 0 1 1 0 1 1
1 0 1 0 0 0 1 0
1 0 1 0 1 0 1 1
1 0 1 1 0 1 0 1
1 0 1 1 1 1 1 0
1 1 0 0 0 0 1 0
1 1 0 0 1 0 1 1
1 1 0 1 0 1 0 1
1 1 0 1 1 1 1 0
1 1 1 0 0 1 0 1
1 1 1 0 1 1 1 0
1 1 1 1 0 1 1 0
1 1 1 1 1 1 1 1
![Page 17: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/17.jpg)
12
Fig. 9. Gate Level Approximate Compressor Design 2 in [24]
Recently, various designs have been proposed to reduce power
consumption, delay and error rate using approximation technique
for 4-2 compressors [24] - [26]. In [24], it simplified the
compressor by eliminating Cout. Even if Cout is eliminated, the error
rate is only 25%. Moreover, it can also be extended to higher
number of inputs and outputs such as 5-3, 6-2, 7-2 and 15-4
compressor [21] - [24], [27]. Many previous studies can reduce
power consumption as well as delay. However, reducing area is also
important and the synthesized results of the proposed approximate
compressors will be explained in section 4.
3.2 Proposed 4-2 Compressor Design 1
Cin in current compressor module is from either Cout or Carry of
previous module. In this design, Cin equals to Carry in current
module and Carry becomes Cin of the next module. Therefore, they
are always the same value and that is ‘0’. In Fig. 9, X1 is the input
that used once besides other inputs are used twice. Thus, by fixing
X1 as ‘0’ for the all states, Cout can be shortened. Note that the
probability of getting input as ‘1’ is 1/4 from the partial products so
that the error rate of X1 by converting all ‘1’ values to ‘0’ will be
25%. Moreover, if X1 sets as ‘0’ for all states, then the intermediate
result from X0 ⊕ X1 will be X0. It becomes select signal to the
MUX in Fig. 8 (b). Therefore, Cout can be simplified AND operation
between X0 and X3.
Cout = X0 ∙ X3 if: X1 = ‘0’ (3)
![Page 18: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/18.jpg)
13
Fig. 10. Proposed 4-2 Compressor Design 1
![Page 19: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/19.jpg)
14
TABLE III TRUTH TABLE OF PACD 1
X0 X1 X2 X3 Cout’ Sum’ Diff. Diff.[24] Prob.
0 0 0 0 0 0 0 1 81/256
0 0 0 1 0 1 0 0 27/256
0 0 1 0 0 1 0 0 27/256
0 0 1 1 0 0 -2 -1 9/256
0 1 0 0 0 0 -1 0 27/256
0 1 0 1 0 1 -1 0 9/256
0 1 1 0 0 1 -1 0 9/256
0 1 1 1 0 0 -3 0 3/256
1 0 0 0 0 1 0 0 27/256
1 0 0 1 0 0 -2 0 9/256
1 0 1 0 1 0 0 0 9/256
1 0 1 1 1 1 0 0 3/256
1 1 0 0 0 1 -1 -1 9/256
1 1 0 1 0 0 -3 0 3/256
1 1 1 0 1 0 -1 0 3/256
1 1 1 1 1 1 -1 -1 1/256
Outputs from the proposed design 1 are noted in Table Ⅲ. Diff.
in column 7 means that the difference between exact output value
and approximated output value in decimal. When the error rate is
simply calculated, it will be 10/16. However, note that all inputs are
from partial product so that probability of each of row in Table Ⅲ
has to be re-calculated. For example, if all inputs are ‘0’, then
the probability will be 81/256. This is reflected in Prob. in Table Ⅲ.
Thus, the actual error rate of the proposed design 1 is 32 %. This is
positive, it shows lower inaccurate results than other proposed
design while keeping delay of critical path 2Δ.
3.3 Proposed 4-2 Compressor Design 2
Fig. 11 shows the proposed approximate 4-2 compressor design
2. The Carry of the current stage has the same value with Cin from
![Page 20: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/20.jpg)
15
the previous stage in 24 out of 32 states as shown in Table Ⅱ.
Therefore, if Carry is fixed to have the same value as Cin, which is
denoted as Carry’ in Fig. 11, the error rate of Carry’ is just 25 %
(=8/32). The difference between proposed compressor design 1
and design 2 is that the Carry’ of design2 propagates to the last
compressor. In other word, if Cin of the first compressor is ‘0’,
there will be no carries. However, if Cin is ‘1’ from the first
compressor module, it will propagate all the way up to the very last
module of either compressor or adder.
Cout remains as it is since Cout and Carry are having one bit
higher weight than that of Sum. If Cout is approximated, the
difference between an erroneous value and exact value of the
output will be very large. Sum is approximated to Sum’ as shown
in Fig. 11 to reduce the delay and power of the design on the
critical path. The approximated Carry’ and Sum’ are expressed as
(2). The proposed approximate 4-2 compressor design 2 reduces
the number of XOR gates to one by approximating Sum and Cin.
Therefore, this design has a critical path of 2Δ which is 1Δ less
than the optimized exact design in Fig. 8 (b).
Carry’ = Cin
Sum’ = (X0 ⊕ X2) ∙ X3 (4)
Fig. 11. The Proposed Approximate 4-2 Compressor
Table Ⅳ shows the truth table of the proposed approximate 4-2
compressor design. Diff. in the last column represents the
difference between exact and approximate output values in decimal.
![Page 21: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/21.jpg)
16
Here, the output is the left side of (4). If Diff. is negative, an
inaccurate output is smaller than an exact output. The biggest
difference between exact and approximate outputs are ‘-2’
which is in states 4 and 16 in Table Ⅳ. Their input probabilities are
only 9/256 and 1/256, respectively, considering the input values
from partial products. Thus, the impact of errors in an approximate
multiplier including the proposed 4-2 compressors will be
acceptable. The error analysis will be given in Section 4.
![Page 22: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/22.jpg)
17
Table IV Truth Table of PACD 2
Cin X0 X1 X2 X3 Cout’ Carry’ Sum' Diff. Diff. [24]
Prob.
0 0 0 0 0 0 0 0 0 1 81/512
0 0 0 0 1 0 0 0 -1 0 27/512
0 0 0 1 0 0 0 0 -1 0 27/512
0 0 0 1 1 0 0 0 -2 -1 9/512
0 0 1 0 0 0 0 0 -1 0 27/512
0 0 1 0 1 0 0 1 -1 0 9/512
0 0 1 1 0 1 0 0 0 0 9/512
0 0 1 1 1 1 0 1 0 0 3/512
0 1 0 0 0 0 0 0 -1 0 27/512
0 1 0 0 1 0 0 1 -1 0 9/512
0 1 0 1 0 1 0 0 0 0 9/512
0 1 0 1 1 1 0 1 0 0 3/512
0 1 1 0 0 1 0 0 0 -1 9/512
0 1 1 0 1 1 0 0 -1 0 3/512
0 1 1 1 0 1 0 0 -1 0 3/512
0 1 1 1 1 1 0 0 -2 -1 1/512
1 0 0 0 0 0 1 0 1 0 81/512
1 0 0 0 1 0 1 0 0 -1 27/512
1 0 0 1 0 0 1 0 0 -1 27/512
1 0 0 1 1 0 1 0 -1 -2 9/512
1 0 1 0 0 0 1 0 0 -1 27/512
1 0 1 0 1 0 1 1 0 -1 9/512
1 0 1 1 0 1 1 0 1 -1 9/512
1 0 1 1 1 1 1 1 1 -1 3/512
1 1 0 0 0 0 1 0 0 -1 27/512
1 1 0 0 1 0 1 1 0 -1 9/512
1 1 0 1 0 1 1 0 1 -1 9/512
1 1 0 1 1 1 1 1 1 -1 3/512
1 1 1 0 0 1 1 0 1 -2 9/512
1 1 1 0 1 1 1 0 0 -1 3/512
1 1 1 1 0 1 1 0 0 -1 3/512
1 1 1 1 1 1 1 0 -1 -2 1/512
![Page 23: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/23.jpg)
18
3.4 Unsigned Dadda Tree Multiplier
The proposed approximate 4-2 compressor is applied to the
typical unsigned 8 × 8 Dadda tree multiplier in Fig. 12. The partial
products shown in black circles in Fig. 12 are from two input AND
gates and n partial products will be generated from n × n multiplier.
Eight 4-2 compressors are needed to reduce partial products in
step 1, whereas another ten 4-2 compressors are required in step
2 to reach final addition. Moreover, three half adders and three full
adders, which are not illustrated in Fig. 12, are needed. The
approximate compressor is applied on the least significant n bit and
the optimized exact compressors are used on the rest of n bit
calculation. For the proposed approximate compressor design 1, we
ignore and eliminate Carry so that there are no carry propagation in
step 1. The carries generated in the approximate compressor,
which is Cout’ will be moved to the step 2. At the same way, Carry
from the approximate compressor in step 2 will be ignored so that
there are no carry propagation as well. Approximate and exact
compressors are shown as dashed and solid boxes, respectively, in
Fig. 12. The mixed structure like this is to reduce the error
distance (ED) between exact and erroneous outputs. This
methodology is also used in [24],[25].
![Page 24: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/24.jpg)
19
Fig. 12. 8 × 8 Dadda Tree Multiplier using PACD 1
3.5 Signed Modified Baugh Wooley Multiplier
The proposed approximate 4-2 compressors are applied to the
typical Baugh Wooley multiplier for signed multiplications. The
typical signed 8 × 8 Baugh Wooley multiplication is shown in Fig.
13. The black solid dot means the partial product from AND
operation of two operand. The white dot means is bar of partial
product. For example, the original partial product is ‘1’, and then
bar of it is ‘0’. Then, they are reduced using half and full adders to
get the final results. In order to reduce the computational cost of
the multiplier, exact 4-2 compressors are implemented instead of
adders as the same way as Fig. 13. The difference from Fig. 12 is
carry propagation of the proposed compressor design 2. As
illustrated in Fig. 12, Cin equals Carry’. In other words, there are no
carry propagation among the proposed compressor design 2.
However, if the next module is either exact compressor or adders,
![Page 25: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/25.jpg)
20
the carry has to be propagated to the next module. If the Cin is ‘0’
from the very first module of the proposed compressor design 2, it
is the same as the Fig. 12. Else, the carry is ‘1’, the carry has to be
propagated to the exact compressor.
Fig. 13. 8 × 8 Signed Baugh Wooley Dadda Tree Multiplier using
PACD 2
3.6 Signed Radix-4 Booth Multiplier
The Booth multiplier is well known algorithm to calculate
multiplication efficiently [10] – [12]. The 4-radix booth multiplier
is used in the proposed multiplier. The second multiplicand is
encoded based on the Table Ⅴ. Therefore, the initial partial product
groups will be reduced by half. In 8 × 8 multiplication’s case, the
initial partial product group will be four.
![Page 26: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/26.jpg)
21
Table V Radix – 4 Booth Encoding
Groups Partial products
000 0
001 1*multiplicand
010 1*multiplicand
011 2*multiplicand
100 -2*multiplicand
101 -1*multiplicand
110 -1*multiplicand
111 0
Fig. 14 8 × 8 Signed Radix-4 Booth Multiplier using PACD 2
Applying PACD 2 on radix – 4 Booth multiplier is illustrated in
Fig. 14. In the step 1, only full and half adders are used to reduce
the circuitry. The carries are not propagated to the next module,
![Page 27: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/27.jpg)
22
they are moved to the next step. Therefore, delay can be decreased.
In the step 2, 4-2 compressors are used to reduce the circuitry.
For the last step, carry propagate adder is used to get the final
results.
3.7 A Modified MAC
The typical MAC illustrated in Fig. 7 computes one input data
with one weight. In order to reduce the number of DSP usage in
FPGA, [29] proposed a double MAC. However, the issue of double
MAC is only suitable for the small bit calculation. If the number of
bits in multiplication exceeds 27 × 18 which is DSP48E2
specification, two DSPs or one DSP with LUTs will be required.
Thus, this paper suggests a modified double MAC for efficient
FPGA resource utilization.
(a) (b)
Fig. 15. (a) Four 2 × 2 Kernel Convolution (b) A Modified MAC
Structure.
Fig. 15 shows a modified MAC structure. Fig. 15 (a) is an
example of four 2 × 2 kernel convolution with 1 stride. For the
conventional MAC, it uses only DSPs for the one kernel computation.
![Page 28: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/28.jpg)
23
Therefore, total 4 DSPs are required to compute four 2 x 2 kernel
CONV layer. However, for the deeper convolutions, which are used
for high accuracy classification or recognition, the number of DSPs
in FPGA may not be sufficient to cover all of the MAC operations in
CONV layers so that the FPGA tool will be automatically convert
rest of MAC operations to the LUTs when the design is synthesized.
This is not very efficient to convert DSPs to LUTs automatically by
the tool. However, the multipliers, such as Booth and Baugh Wooley,
well optimized multipliers for area, power consumption and delay.
Moreover, they can be further optimized by using approximate
compressors.
In this paper, the approximate compressor designs are applied on
the Booth and Baugh Wooley multipliers. They are implemented in
the modified MAC for high performance and efficient resource
utilization of FPGA.
3.8 VDSR Hardware Structure
CNN can be used for many applications. In this paper, the
modified MAC is implemented on VDSR as shown Fig. 16. An input
image from host PC is 256 pixel × 256 pixel and goes through 12
CONV layers in FPGA board. The number of bits in inputs is 14 bits
so that the previous multipliers are needed to extended to 14 bits.
Moreover, as most of CNNs do, outputs from MAC operations are
truncated to 14 bits. Input image and weight data will be stored in
DRAM in FPGA. ARM CPU in FPGA board receives command from
the host such as data transfer and running CNN in FPGA. When
Host gives ‘Run’ command to the FPGA, FPGA starts to run CNN
for super-resolution. The intermediate data from CONV layer will
be stored in on-chip SRAM. It works as buffer of CNN. An output
image will be 512 × 512 size super resolution image.
![Page 29: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/29.jpg)
24
Fig. 16. VDSR Hardware Structure
![Page 30: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/30.jpg)
25
Chapter 4. Evaluation Results
4.1 Approximate Compressors
The two proposed compressors of this paper and the exact 4-2
compressor [21] are implemented in Verilog and synthesized using
Synopsys Design Complier with TSMC 65nm standard cell library.
The simulation results are summarized of delay, power consumption
and power-delay product (PDP) as shown in Table Ⅵ. The
proposed approximate compressor design 2 (PACD 2) shows 52%
and 72% improvement in terms of delay power consumption
respectively. It is because the critical path of delay has been
reduced by 1Δ and the number of gate count is decreased.
Therefore, these are affect delay and power consumption in the
designs
Table VI Compressors Synthesized Results
Design Area
(μm2) Delay (ns)
Power (μW)
PDP ADP APP
Exact Design[21] 24.4 0.23 5.65 1.30 5.61 137.85
Design 2[24] 14.8 0.09 1.63 0.15 1.33 24.10
PACD 1 14.4 0.1 1.93 0.19 1.44 27.85
PACD 2 10.8 0.11 1.57 0.17 1.19 16.97
4.2 Approximate Compressors in Multipliers
The performance of the approximate compressors in multipliers,
where the proposed compressors are used, is evaluated. The
implementation and synthesized results are compared with when
previous compressor designs in [21], [24] are used. The proposed
compressors are to the unsigned 8 × 8 Dadda Tree multiplier for
fair comparison with exact and previous work. Moreover, the
proposed compressor design 2 are also applied on Booth and Baugh
Wooley multipliers to check if there is any dependency among
multipliers.
![Page 31: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/31.jpg)
26
Table VII Synthesized Results of Different Multiplier Designs
Design Multiplier Area
(μm2) Delay (ns)
Power (μW)
PDP (102)
ADP (103)
APP (105)
Exact
Design [21]
Dadda
Tree 772.8 1.31 205.73 2.70 1.01 1.59
Design 2 [24] Dadda
Tree 686.4 1.04 155.34 1.62 0.71 1.07
PACD 1 Dadda
Tree 682.8 1.07 151.73 1.62 0.73 1.04
PACD 2 Dadda
Tree 574 1.04 143.98 1.50 0.60 0.83
Exact
Design [21] Booth 986 2.21 254.57 5.63 2.18 2.51
Design 2 [24] Booth 947.6 1.76 227.58 4.01 1.67 2.16
PACD 1 Booth 946 1.76 231.19 4.07 1.66 2.19
PACD 2 Booth 931 1.76 219.68 3.87 1.64 2.05
Exact
Design [21] BW 772.4 1.32 213.36 2.82 1.02 1.65
Design 2 [24] BW 695.6 1.19 166.87 1.99 0.83 1.16
PACD 1 BW 692.4 1.19 164.86 1.96 0.82 1.14
PACD 2 BW 663.6 1.19 153.27 1.82 0.79 1.02
PDP, area delay product (ADP) and area power product (APP)
are shown in Table Ⅶ. To compare with the exact design, the
multipliers with the proposed approximate compressors are
improved in area, delay and power consumption. Area and power
consumption have been reduced by 25.7 % and 45 %, respectively
when comparing exact compressor and the proposed design 2
compressor on Dadda Tree multiplier. This is important because as
addressed in Section 1, many multipliers operate in parallel in most
applications and thus, area and power consumption are proportional
to the number of multipliers. Compared with the previous works in
[24], the proposed compressors achieve the similar power
consumption and delay performance as Design 1 and Design 2.
![Page 32: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/32.jpg)
27
However, the area of the the proposed compressor design 2 is
much smaller than that of Designs 1 and 2. When compared to
Design 2, ADP and APP are improved by 12.7 % and 12.3 %,
respectively, in the proposed compressor design 2. The synthesized
results show the difference among multipliers. Booth and Baugh
Wooley multiplier are compared because they are signed
multiplication so that they can be implemented on CNN. Booth
shows the most significantly improvement in terms of delay while
area is not reduced. Booth is 36 % faster than Baugh Wooley
because, Booth has only 4 partial product groups while Baugh
Wooley have 8 partial product groups. However, Baugh Wooley
shows better area and power efficiency.
4.3 Error Analysis
The error metrics that is used in this paper are from [23]. ED is
the difference between an exact output and erroneous output.
Normalized error distance (NED) is normalized by the maximum
value that the erroneous output can have, whereas Mean relative
error distance (MRED) is the average of relative error distance
(RED) which implies the ratio of ED to the accurate output. The
error rate is the ratio of inaccurate output states out of total 65,536
states. Mean square error (MSE) is also used to analyze error of
approximate multipliers. MSE is the most important metric for
image quality so that peak signal to noise ratio (PSNR) also uses
MSE to compare image quality between an original image and an
optimized image.
As presented in Table Ⅷ, the proposed PACD2 has the lowest
error rate in terms of MSE among different types of approximate
compressors for all types of multipliers. Moreover, PACDs show
lower MRED than that of Design 2 in [24] because PACDs are
100 % accurate when the exact outputs are ‘0’, whereas Design
2 is not. This is critical for some CNN applications, such as super-
resolution. In [28], it uses twelve convolution layers and there are
rectified linear unit (ReLU) functions as activation function right
after each convolution layers. If the outputs from previous layer are
‘0’ or negative, then inputs for the next layer will be ‘0’ after
![Page 33: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/33.jpg)
28
ReLU. Therefore, if the approximate outputs are not ‘0’ when
the exact values are ‘0’, the ED will increase along with passing
through convolution layers.
Table VIII Results of Error Analysis
Compressor Multiplier MSE MRED NED Error rate
Design 2 Dadda
Tree 1.1825 0.0522 0.0023 52.3%
PACD 1 Dadda
Tree 1.2192 0.0489 0.0036 45.8%
PACD 2 Dadda
Tree 1.0918 0.0418 0.0031 63.3%
Design 2 Booth 0.4118 0.0651 0.0065 40.5%
PACD 1 Booth 0.4154 0.0662 0.0063 33.8%
PACD 2 Booth 0.2865 0.0031 0.0045 28.6%
Design 2 BW 0.8506 0.1227 0.0112 63.2%
PACD 1 BW 0.9734 0.1090 0.0116 67.8%
PACD 2 BW 0.7922 0.1232 0.0109 63.0%
4.4 Multiplier Comparison in CNN application
The modified MAC is implemented CNN application, which is
VDSR in this paper. Fig. 17 shows VDSR experimental environment.
Host PC runs windows visual studio. It quantizes input image and
weights data and transfers to the FPGA. Host PC communicates
with FPGA through Xilinx Vivado SDK ver. 2017.03 tool. The Host
PC gives image and weights data to the FPGA and FPGA stores
them in DRAM. Once all of the data is available, the host sends
command to run CNN in FPGA. VDSR uses 14 x 14 multiplication
for CNN. The input 14 bits consist of 1 signed bit, 3 integer bit and
10 fraction bit. At first, the approximate compressors are applied
half of partial product which is 14 bits.
![Page 34: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/34.jpg)
29
Other experiments are split with multipliers, compressors and
how many numbers of bit are applied. An approximate compressor
is applied on 10 bits out of 28 bits since it means half of fraction
bits.
Fig. 17. VDSR Experimental Environment
Table Ⅸ shows PSNR comparison among various approximate
MACs. As expected based on error analysis, PACD 2 on Booth
multiplier shows the highest image quality with 37.6 dB as Lena
image. Typically, if PSNR is above 30dB, then the quality of the
image is good enough. When PACD 1 is compared with Design 2,
PSNR is higher even if MSE is lower. This is because Design 2 has
errors when the outputs are expected ‘0’ value. VDSR output image
results are illustrated in Fig. 18.
![Page 35: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/35.jpg)
30
Table IX Synthesized in FPGA Results
Multiplier Compressor Power(W) Delay(ns) PSNR
baby(dB) PSNR
Lena(dB)
Normal - 4.23 16.81 - -
Booth Exact 4.23 16.71 - -
BW Exact 4.24 16.62 - -
Booth PACD 1 4.23 16.05 33.48 33.88
BW PACD 2 4.23 15.64 31.5 31.91
Booth PACD 2 4.23 15.84 36.68 37.6
Booth Design 2 4.24 16.33 27.88 28.18
![Page 36: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/36.jpg)
31
(a) Exact Design
(b) PACD 1 on Booth (33.48dB) (c) PACD 2 on BW (31.5dB)
(d) PACD 2 on Booth (36.68dB) (e) Design 2 [24] on Booth
(27.88dB)
Fig. 18. VDSR Image Results
![Page 37: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/37.jpg)
32
Chapter 5. Conclusion
The two novel approximate 4-2 compressors are applied on
Booth multiplier and showing the best performance in terms of
super-resolution image qualification shown as PSNR. The final
summary of total experiment results are shown in Table Ⅹ. BW
multipliers are resulting better for area, delay, power consumption
than those of Booth multiplier. However, super-resolution image
results based on BW multipliers are not qualified. BW multipliers
have more partial product groups so that require more compressors
to reduce circuitry to get the final result.
Table X Summary of Experiment Results
Multiplier Compressor Area
(μm2) Delay (ns)
Power (μW)
MSE PSNR (dB)
Booth Design 2 947.6 1.76 227.58 0.4118 28.18
Booth PACD 1 946 1.76 231.19 0.4154 33.88
Booth PACD 2 931 1.76 219.68 0.2865 37.6
The reason of large difference between PSNRs of design 2 [24]
and PACD 1 even though they have the similar MSE is the design 2
has errors when the output value is expected as ‘0’. This is
important because most of convolution layers use ReLU as
activation function. When the results from MAC operations pass
through ReLU, the negative values become ‘0’ while the positive
values remains as they are.
PACD 2 on Booth also shows the smallest area and power
consumption among the approximate compressors. This is because
the number of logic gate count of PACD 2 is the least.
![Page 38: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/38.jpg)
33
Chapter 6. Discussion
State-of-arts Convolutional Neural Network (CNN) has more
convolution layers than those in the past to increase the accuracy of
classification and recognition. Many researches have focused on
reducing network size to save computational cost with keeping high
accuracy. It is also important to optimize a convolution layer itself
to reduce computational cost. This paper proposes approximate
computing using novel 4-2 compressors and applies on Baugh
Wooley and Booth multiplier. Moreover, a multiply and accumulate
(MAC) unit is modified for high efficient Field Programmable Gate
Array (FPGA) resources utilization. In addition, the modified MACs
are implemented on very deep convolutional network for image
super-resolution (VDSR) which have not been tried in previous
works. As results, the proposed approximate compressors shows
50% and 68% less delay and power consumption respectively than
the exact design. Moreover, the modified MAC shows 10%, 11%
improvement in terms of APP and ADP respectively based on
Synopsis Design complier synthesized results. The overall results
shows that delay and power consumption improvement with
meaningful PSNR value comparing with original image. In this paper,
two images are compared with accurate MAC. These two images
are ones of the most famous images when image quality is
compared.
![Page 39: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/39.jpg)
34
Bibliography
[1] Y. LeCun et al., "Backpropagation Applied to Handwritten Zip
Code Recognition," in Neural Computation, vol. 1, no. 4, pp.
541-551, Dec. 1989.
[2] Yann LeCun and Yoshua Bengio. 1998. Convolutional networks
for images, speech, and time series. In The handbook of brain
theory and neural networks, Michael A. Arbib (Ed.). MIT Press,
Cambridge, MA, USA 255-258.
[3] A. Krizhevsky, I. Sutskever, G. E. Hinton, “ Imagenet
classification with deep convolutional neural networks, ”
Advances in Neural Information Processing Systems, vol. 2, pp.
1097-1105, 2012.
[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for
image recognition. In CVPR, 2016.
[5] J. Qiu et al., “Going deeper with embedded fpga platform for
convolutional neural network,” in FPGA. ACM, 2016, pp. 26–
35.
[6] Kestur, Srinidhi & Davis, John & Williams, Oliver. (2010). BLAS
comparison on FPGA, CPU and GPU. Proceedings - IEEE
Annual Symposium on VLSI, ISVLSI 2010. 1. 288-293.
[7] E. Nurvitadhi, J. Sim, D. Sheffield, et. al., “ Accelerating
recurrent neural networks in analytics servers: Comparison of
FPGA, CPU, GPU, and ASIC,” Field Programmable Logic and
Applications (FPL), 2016.
[8] S. Che, J. Li, J. W. Sheaffer, K. Skadron and J. Lach,
"Accelerating Compute-Intensive Applications with GPUs and
FPGAs," 2008 Symposium on Application Specific Processors,
Anaheim, CA, 2008, pp. 101-107.
[9] E. Nurvitadhi, G. Venkatesh, J. Sim, et. al., “Can FPGAs Beat
GPUs in Accelerating Next-Generation Deep Neural
Networks?” International Symposium on Field-Programmable
Gate Arrays (ISFPGA), 2017.
![Page 40: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/40.jpg)
35
[10] Kyung-Ju Cho, Kwang-Chul Lee, Jin-Gyun Chung and K. K.
Parhi, "Design of low-error fixed-width modified booth
multiplier," in IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 12, no. 5, pp. 522-531, May
2004.
[11] K. J. Cho, K. C. Lee, J. G. Chung, and K. K. Parhi, “Low
error fixedwidth modified Booth multiplier,” in Proc. IEEE
Workshop on Signal Processing Systems, San Diego, CA, Oct.
2002, pp. 45–50.
[12] W. Liu, L. Qian, C. Wang, H. Jiang, J. Han and F. Lombardi,
"Design of Approximate Radix-4 Booth Multipliers for Error-
Tolerant Computing," in IEEE Transactions on Computers, vol.
66, no. 8, pp. 1435-1441, 1 Aug. 2017.
[13] S. R. Chowdhury, A. Banerjee, A. Roy and H. Saha, "Design,
Simulation and Testing of a High Speed Low Power 15-4
Compressor for High Speed Multiplication Applications," 2008
First International Conference on Emerging Trends in
Engineering and Technology, Nagpur, Maharashtra, 2008, pp.
434-438.
[14] S. Che, J. Li, J. W. Sheaffer, K. Skadron and J. Lach,
“Accelerating Compute-Intensive Applications with GPUs and
FPGAs,” 2008 Symposium on Application Specific Processors,
Anaheim, CA, 2008, pp. 101-107.
[15] Lin X., Zhao C., Pan W. (2017) Towards accurate binary
convolutional neural network. In: NIPS, pp. 344–352
[16] S. Venkatachalam, E. Adams, H. J. Lee and S. Ko, “Design
and Analysis of Area and Power Efficient Approximate Booth
Multipliers,” in IEEE Transactions on Computers, vol. 68, no. 11,
pp. 1697-1703, 1 Nov. 2019.
[17] G. Zervakis, K. Tsoumanis, S. Xydis, D. Soudris and K.
Pekmestzi, “Design-Efficient Approximate Multiplication
Circuits Through Partial Product Perforation,” in IEEE
Transactions on Very Large Scale Integration (VLSI) Systems,
vol. 24, no. 10, pp. 3105-3117, Oct. 2016.
[18] D. Esposito, A. G. M. Strollo, E. Napoli, D. De Caro and N.
Petra, “Approximate Multipliers Based on New Approximate
![Page 41: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/41.jpg)
36
Compressors,” in IEEE Transactions on Circuits and Systems I:
Regular Papers, vol. 65, no. 12, pp. 4169-4182, Dec. 2018.
[19] P. Kulkarni, P. Gupta and M. Ercegovac, “Trading Accuracy
for Power with an Underdesigned Multiplier Architecture,” 2011
24th Internatioal Conference on VLSI Design, Chennai, 2011, pp.
346-351.
[20] C. Chang and R. K. Satzoda, “A Low Error and High
Performance Multiplexer-Based Truncated Multiplier,” in IEEE
Transactions on Very Large Scale Integration (VLSI) Systems,
vol. 18, no. 12, pp. 1767-1771, Dec. 2010.
[21] Chip-Hong Chang, Jiangmin Gu and Mingyan Zhang, “Ultra
low-voltage low-power CMOS 4-2 and 5-2 compressors for
fast arithmetic circuits,” in IEEE Transactions on Circuits and
Systems I: Regular Papers, vol. 51, no. 10, pp. 1985-1997, Oct.
2004.
[22] X. Yi, H. Pei, Z. Zhang, H. Zhou and Y. He, “Design of an
Energy-Efficient Approximate Compressor for Error-Resilient
Multiplications,” 2019 IEEE International Symposium on Circuits
and Systems (ISCAS), Sapporo, Japan, 2019, pp. 1-5.
[23] J. Liang, J. Han and F. Lombardi, “New Metrics for the
Reliability of Approximate and Probabilistic Adders,” in IEEE
Transactions on Computers, vol. 62, no. 9, pp. 1760-1771, Sept.
2013.
[24] A. Momeni, J. Han, P. Montuschi and F. Lombardi, “Design
and Analysis of Approximate Compressors for Multiplication,” in
IEEE Transactions on Computers, vol. 64, no. 4, pp. 984-994,
April 2015.
[25] Z. Yang, J. Han and F. Lombardi, “Approximate
compressors for error-resilient multiplier design,” 2015 IEEE
International Symposium on Defect and Fault Tolerance in VLSI
and Nanotechnology Systems (DFTS), Amherst, MA, 2015, pp.
183-186.
[26] Weinan Ma and Shuguo Li, “A new high compression
compressor for large multiplier,” 2008 9th International
Conference on Solid-State and Integrated-Circuit Technology,
Beijing, 2008, pp. 1877-1880.
![Page 42: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/42.jpg)
37
[27] R. Marimuthu, Y. E. Rezinold and P. S. Mallick, “Design and
Analysis of Multiplier Using Approximate 15-4 Compressor,” in
IEEE Access, vol. 5, pp. 1027-1036, 2017.
[28] D. Lee, S. Lee, H. S. Lee, H. Lee and K. Lee, “Context-
Preserving Filter Reorganization for VDSR-Based Super-
resolution,” 2019 IEEE International Conference on Artificial
Intelligence Circuits and Systems (AICAS), Hsinchu, Taiwan,
2019, pp. 107-111.
[29] D. Nguyen, D. Kim and J. Lee, "Double MAC: Doubling the
performance of convolutional neural networks on modern
FPGAs," Design, Automation & Test in Europe Conference &
Exhibition (DATE), 2017, Lausanne, 2017, pp. 890-893.
[30] Yarin Gal and Zoubin Ghahramani. Bayesian convolutional
neural networks with Bernoulli approximate variational
inference. arXiv preprint arXiv:1506.02158, 2015.
![Page 43: Disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · MAC is implemented VDSR hardware to compare output images with reference and resulting 37.6dB](https://reader034.vdocuments.pub/reader034/viewer/2022042309/5ed60f7349af592c00577542/html5/thumbnails/43.jpg)
38
Abstract
최신 Convolution Neural Network (CNN)의 동향을 보면, 지난
과거와 비교하여 인식과 분류의 정확도를 높이기 위해 더 많은
Convolution layer 를 가지고 있는 것이 특징이다. 그 동안 높은
정확도를 유지하면서, 연산량을 줄이기 위해서는 Network 의 크기 그
자체를 줄이려는 시도들도 있었고, 하나의 Network 에서 Approximate
Computing 방법을 이용하여 줄이려는 시도들도 있었다. 본 논문에서는
새로운 4-2 Compressor 를 고안하여 이를 기존에 잘 알려진 Baugh
Wooley 나 Booth 곱셈기에 적용하는 Approximate computing 방법을
제안하였다. Convolution layer 는 곱셈과 그 결과 값을 누적으로
더하는 동작으로 이루어져 있고, 이를 MAC 이라고 한다. 본 논문에서
제안하는 Approximate Compressor 가 적용된 Multiplier 를 MAC 에
적용하고, FPGA 의 Resource 의 효율적인 배분을 위하여 MAC 을
수정하였다. 그 결과 기존의 정확한 Compressor 와 비교 시에 Delay 와
Power 측면에서 각각 50%, 68% 의 향상이 있었다. 뿐만 아니라,
MAC 에 적용하여 비교하였을 시에는 APP 와 ADP 기준으로 각각
10%와 11% 감소하였다. 최종적으로 MAC 을 VDSR 하드웨어에
적용하여 Super-Resolution 된 이미지를 검증하였다.