disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · mac is...

저 시-비 리 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

l 차적 저 물 성할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.

l 저 터 허가를 면 러한 조건들 적 되지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 목적 할 수 없습니다.

http://creativecommons.org/licenses/by-nc/2.0/kr/legalcode

http://creativecommons.org/licenses/by-nc/2.0/kr/

Project Report of Master of Engineering

Design of a Low-Power and High

Performance MAC for CNNs

CNN을 위한 Low-Power 와 High Performance

MAC 설계

February 2020

Graduate School of Engineering Practice

Seoul National University

Department of Engineering Practice

Seungwan Baek

Design of a Low-Power and High

Performance MAC for CNNs

Prof. Hyuk-Jae Lee

Submitting a Master’s Project Report

February 2020

Graduate School of Engineering Practice

Seoul National University

Department of Engineering Practice

Seungwan Baek

Confirming the Master’s Project Report written

by

Seungwan Baek

February 2020

Chair Cheol Seong Hwang (Seal)

Examiner Hyuk-Jae Lee (Seal)

Examiner Woo-Young Kwak (Seal)

i

Abstract

State-of-arts Convolutional Neural Network (CNN) has more

convolution layers than those in the past to increase the accuracy of

classification and super-resolution. Many researches have focused

on reducing network size to save the computational cost with

keeping high accuracy, and studied to optimize a convolution layer

itself to reduce computational cost. This paper proposes

approximate computing using novel 4-2 compressors and applies

on Baugh Wooley and Booth multiplier. Convolution layers in CNNs

consist of multiply- and -accumulate (MAC). We applied the

approximate multiplier into the modified MAC for high efficient Field

Programmable Gate Array (FPGA) resource utilization. As results,

the proposed approximate compressors show 11.5% and 29.6% less

area-delay product (ADP) and area-power product (APP)

respectively than the previous work design. Finally, the modified

MAC is implemented VDSR hardware to compare output images

with reference and resulting 37.6dB with PACD 2 on Booth

multiplier

Keyword: CNN; FPGA; MAC, Multiplier; Approximate Compressor

Student Number: 2018-28454

ii

Table of Contents

Chapter 1. Introduction ..................................................... １

1.1 Study Background ......................................................... １

Chapter 2. Backgrounds .................................................... ５

2.1 CNN Architecture .......................................................... ５

2.2 CNN Hardware Architecture ........................................ ６

Chapter 3. Proposed MAC for CNN ................................... ９

3.1 Exact 4-2 Compressor ................................................ ９

3.2 Proposed 4-2 Compressor Design 1 ...................... １２

3.3 Proposed 4-2 Compressor Design 2 ...................... １４

3.4 Unsigned Dadda Tree Multiplier .............................. １８

3.5 Signed Modified Baugh Wooley Multiplier .............. １９

3.6 Signed Radix-4 Booth Multiplier ............................ ２０

3.7 A Modified MAC ........................................................ ２２

3.8 VDSR Hardware Structure ....................................... ２３

Chapter 4. Evaluation Results ........................................ ２５

4.1 Approximate Compressors ....................................... ２５

4.2 Approximate Compressors in Multipliers ............... ２５

4.3 Error Analysis ........................................................... ２７

4.4 Multiplier Comparison in CNN application .............. ２８

Chapter 5. Conclusion .................................................... ３２

Chapter 6. Discussion .................................................... ３３

１

Chapter 1. Introduction

1.1 Study Background

Convolutional Neural Network (CNN) was invented by Y. LeCun

et al. [1, 2], and becomes more popular recently because it has

strength for image recognition, classification, speech recognition etc.

than other deep learning algorithms. It can be divided into two parts.

The first part is training. For example, in order to classify image, it

trained each image features of the existing image in the database.

The second part is classification. It analyzes a new input image

feature and try to match with training data. Then it classified what

the image is. We will explain more details of CNN in the next

section. It is important to classify the image accurately. In order to

increase the accuracy of recognition, CNN needs to have more

convolution layers. For example, AlexNet in Fig. 1, which has five

convolution layers, won the first prize at ImageNet Large Scale

Visual Recognition Challenge (ILSVRC) achieving a top-5 error of

15.3% in 2012 [3]. However, ResNet, which won the ILSVRC 2015

expanded the number of layers to 152 and achieved the Top-5

error of 5.71% [4]. Table Ⅰ shows the Top-5 error rate between

2012 and 2015 at the ILSVRC. It has dramatically decreased as the

number of layers is growing. This suggests that increasing the

number of convolution layers is essential for high accuracy in CNN

network. However, when CNNs are implemented in a hardware

system such as CPU, GPU and Field Programmable Gate Array

(FPGA), the hardware cost increases in proportion to the number of

layers. It is because convolution layers are computation intensive

whereas fully connected (FC) layers are memory intensive [5]. In

order to reduce the computational cost of CNNs, various approaches

have been studied. Y. Gal et al. [30] proposed Bayesian CNN with

small data. This is by placing a probability distribution over the

CNN’s kernel. X. Lin et al. [15] studied regarding binary CNN. The

values of weight and bias are constrained to {-1, +1} at run-time.

Therefore, it can reduce the memory size and computational cost

dramatically.

Convolution layers in hardware consist of multipliers and adders,

which are known as multiply and accumulate (MAC) operations.

２

The large number of layers require long computation time as well

as many hardware resources. Therefore, there needs to find the

best trade-off between accuracy and time/cost.

Fig. 1. The AlexNet Architecture

Table I FROM 2012 TO 2015 ILSVRC TOP-5 ERROR RATE

Network Error Rate

(%)

Depth

(Number of

Layers)

AlexNet

(ILSVRC ’12) 15.3 8

VGG

(ILSVRC ’14) 7.3 19

GoogLeNet

(ILSVRC ’14) 6.7 22

ResNet

(ILSVRC ’15) 5.71 152

CPU and GPU are widely used hardware systems for deep

learning. Both are performance-oriented hardware architectures

[6]. Thus, it consumes a lot of power. In other word, energy-to-

performance efficiency is not good. The hardware accelerator is

one of the typical implementation types for deep learning along with

CPU and GPU. The off-chip hardware accelerator architecture is

illustrated in Fig. 2. CPU sends operation commands to the

hardware accelerator, and it communicates with the memory

directly to perform the tasks that CPU orders. While the hardware

accelerator is working on the tasks, CPU can perform the other

tasks.

３

The hardware accelerator can be considered as various types

of hardware. Application-specific integrated circuit (ASIC) and

FPGA are most popular hardware accelerators. To use the ASIC as

the hardware accelerator is more efficiency than that of the FPGA.

However, the FPGA has more flexibility than that of the ASIC since

many other different types of applications can be implemented on

the FPGA. With these reasons, FPGA environments are getting

popular due to its superior energy-to-performance efficiency [7,

8]. Moreover, recent research works show that performance itself

is also comparable with CPU and GPU [9]. However, the very

limited resources are the main weak point of FPGA systems. For

MAC operations, Digital Signal Processing (DSP) and Look-Up

Table (LUT) blocks can be used in FPGA. Usually, only DSPs are

used for a × b multiplication. The number of MAC operations are

equivalent to the number of DSP usage in FPGA since it is parallel

operations. D. Nguyen et al. [29] proposed double MAC, which can

reduce the number of DSPs by half. However, it is only useful when

the number of bits are small such as 4 × 4 multiplication. If it

exceeds 27 × 18 (DSP48E2 in Xilinx ZCU102 FPGA board)

multiplication, two DSPs or one DSP and extra LUTs are required

for the calculation. Z.Zhang et al. [30] proposed by increasing the

number of MAC operation at the same time. When DSP resources

are not enough to cover all the multiplications required in the

network, LUTs will be used for the rest of multiplications, but it will

not be efficiently used.

Fig. 2. Off-Chip Hardware Accelerator Architecture

４

A multiplier consists of three parts: partial product generation,

partial products reduction and final addition using carry-propagate

adders. Approximation can be applied in any part of these steps.

For example, in [16], approximation technique is implemented in

the partial product generation stage. Partial product perforation that

skips the generation of partial product is considered in [17]. They

implemented the technique in various multipliers and compared the

power consumption and error rate. Meanwhile, in partial products

reduction step, different types of adders, such as half/full adder and

4-2 compressor, are used. A large portion of the previous works

for approximate multipliers have focused on this step. Esposito et al.

[18] proposed the approximate half adder by simplifying a 2-1

compressor (half-adder) with an OR gate. With only 1/16 error

probability, the full adder can be simplified just using one AND gate

and two OR gates. 4-2 compressors are also commonly used to

reduce the partial product step more than half/full adders. Another

approach to approximate multiplier is introduced in [19]. This work

divided large multiplications into small blocks such as 2 × 2

multiplier block and built adjustable multipliers. Truncated method

is also widely studied. C. Chang et al. [20] used a multiplexer-

based truncated array multiplier to reduce power and area.

There is a trade-off between accuracy and computational cost,

such as power consumption and area, for the multiplier

implementation. Therefore, it is imperative to find the optimal point

between them. In this paper, low power and area efficient

multipliers are presented by proposing and applying new

approximate 4-2 compressors. Moreover, conventional MAC is

modified to calculate two operations at the same time. VDSR is used

as an application of CNN to confirm the effects of MACs.

５

Chapter 2. Backgrounds

2.1 CNN Architecture

A CNN consists of input, output layers and hidden layers in

between input and output as shown Fig. 3. The hidden layers are

the key of CNN because it can affect the accuracy of outputs. The

hidden layers are series of convolution layers (CONV layer),

activation function such as rectified linear unit (ReLU), pooling

layers and fully connected (FC) layers.

Fig. 3. A Regular n-layer Neural Network

CONV layers compute the inputs, weights and biases. Weights

are also known as kernels. As shown in Fig. 4, when inputs pass

kernel, inputs are multiplied with weights then accumulated with

biases. It is matrix multiplications so that they are computed in

parallel. The number of outputs will be equal to the number of

kernels.

Fig. 4. Convolution Calculation

The outputs of kernels pass activation functions. ReLU is the

typical activation function that uses widely because it reduces

６

likelihood of the gradient to vanish and it is sparsity. It makes the

negative values to zero and keep the positive values as explained

(1).

h = max (0 , y), where y = wx + b (1)

Another key concept of CNNs is pooling. It is non-linear

functions since it reduces the number of features. Max pooling is

the most commonly used. It takes out the maximum value among

the values in the filter. Fig. 5 shows how the max pooling is

operated.

Fig. 5. Max Pooling with a 2 × 2 filter and stride = 2

FC layers are output layers. After several CONV layers and max

pooling layers, it connects every neuron in one layer to every

neuron in another layer. The flattened matrix goes through a FC

layer to classify.

2.2 CNN Hardware Architecture

Fig. 6 shows an overview of CNN hardware architecture.

Weights are stored in external memory because the amount of the

data is huge. The first CONV layer receives input feature map data

and weights from external memory then compute MAC operations.

The results will be stored in buffer. Thus, it can reduce data

communication between CNN hardware and external memory. As

results, it also can reduce power consumption.

７

Fig. 6. An Overview of CNN Hardware Architecture

CONV layers are well known as computation-intensive layers

while FC layers are known as memory-intensive layers [14]. This

is the reason why we focus on CONV layers to reduce

computational cost. On the other hand, FC layers are required tons

of data from external memory so that it is important to reduce the

number of weights.

Fig. 7. A Conventional MAC Structure

A conventional MAC structure is illustrated in Fig. 7. This is a

computation engine for CONV layer. The two n-bit inputs are

multiplied and became 2n bits. Here input a can be assumed as an

input feature map data of image and input b can be assumed as a

weight value from the kernel. The multiplied result will be added

with other results from the same kernel, and then they will be

accumulated by using adder. Once every operation is finished by

８

using MAC, then the results will pass through ReLU and go to

pooling layer for downsizing as explained previous section.

９

Chapter 3. Proposed MAC for CNN

3.1 Exact 4-2 Compressor

A compressor is widely used in computer arithmetic. The most

popular compressor is 2-1 compressor as known as full adder. It

receives two inputs and one Cin from previous module, then gives

two outputs, which are Sum and Carry. The 4-2 compressor is also

used widely and more often. In a basic 8 × 8 Dadda Tree multiplier

structure, there are 18 4-2 compressors, three half adders and

three full adders to get the result. This will be explained in detail in

the multiplier section.

(a) (b)

Fig. 8. (a) 4-2 Compressor with Serially Connected Two Half

Adder (b) 4-2 Compressor Using XOR and MUX [21]

The conventional exact 4-2 compressor uses two serially

connected full adders as illustrated in Fig. 8 (a). It counts the

number of ‘1’ from inputs. The function of the 4-2 compressor is

shown as following:

Sum + 2 ×(Cout + Carry) = X0 + X1 + X2 + X3 + Cin (2)

Here, X0, X1, X2, X3 and Cin are inputs, whereas Sum, Carry and

Cout are outputs. Cin receives its value from the previous stage of

one bit lower in significance, whereas Cout and Carry are passed to

the next stage of one bit higher in significance. This indicates that

Cout and Carry are one bit higher weight than that of Sum. Thus,

１０

many previous studies have been focused on how to deal with these

carries since it is related to the cell delay. In Fig. 8 (b), the

optimized exact 4-2 compressor [21] is illustrated having 4 XOR

gates, and it has a critical path delay of 3Δ by assuming an XOR

gate has the same delay as an AND gate. The truth table of an exact

4-2 compressor with 32 states is shown in Table Ⅱ.

１１

Table II The Truth Table of Exact 4-2 Compressor

Cin X0 X1 X2 X3 Cout Carry Sum

0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 1

0 0 0 1 0 0 0 1

0 0 0 1 1 0 1 0

0 0 1 0 0 0 0 1

0 0 1 0 1 0 1 0

0 0 1 1 0 1 0 0

0 0 1 1 1 1 0 1

0 1 0 0 0 0 0 1

0 1 0 0 1 0 1 0

0 1 0 1 0 1 0 0

0 1 0 1 1 1 0 1

0 1 1 0 0 1 0 0

0 1 1 0 1 1 0 1

0 1 1 1 0 1 0 1

0 1 1 1 1 1 1 0

1 0 0 0 0 0 0 1

1 0 0 0 1 0 1 0

1 0 0 1 0 0 1 0

1 0 0 1 1 0 1 1

1 0 1 0 0 0 1 0

1 0 1 0 1 0 1 1

1 0 1 1 0 1 0 1

1 0 1 1 1 1 1 0

1 1 0 0 0 0 1 0

1 1 0 0 1 0 1 1

1 1 0 1 0 1 0 1

1 1 0 1 1 1 1 0

1 1 1 0 0 1 0 1

1 1 1 0 1 1 1 0

1 1 1 1 0 1 1 0

1 1 1 1 1 1 1 1

１２

Fig. 9. Gate Level Approximate Compressor Design 2 in [24]

Recently, various designs have been proposed to reduce power

consumption, delay and error rate using approximation technique

for 4-2 compressors [24] - [26]. In [24], it simplified the

compressor by eliminating Cout. Even if Cout is eliminated, the error

rate is only 25%. Moreover, it can also be extended to higher

number of inputs and outputs such as 5-3, 6-2, 7-2 and 15-4

compressor [21] - [24], [27]. Many previous studies can reduce

power consumption as well as delay. However, reducing area is also

important and the synthesized results of the proposed approximate

compressors will be explained in section 4.

3.2 Proposed 4-2 Compressor Design 1

Cin in current compressor module is from either Cout or Carry of

previous module. In this design, Cin equals to Carry in current

module and Carry becomes Cin of the next module. Therefore, they

are always the same value and that is ‘0’. In Fig. 9, X1 is the input

that used once besides other inputs are used twice. Thus, by fixing

X1 as ‘0’ for the all states, Cout can be shortened. Note that the

probability of getting input as ‘1’ is 1/4 from the partial products so

that the error rate of X1 by converting all ‘1’ values to ‘0’ will be

25%. Moreover, if X1 sets as ‘0’ for all states, then the intermediate

result from X0 ⊕ X1 will be X0. It becomes select signal to the

MUX in Fig. 8 (b). Therefore, Cout can be simplified AND operation

between X0 and X3.

Cout = X0 ∙ X3 if: X1 = ‘0’ (3)

１３

Fig. 10. Proposed 4-2 Compressor Design 1

１４

TABLE III TRUTH TABLE OF PACD 1

X0 X1 X2 X3 Cout’ Sum’ Diff. Diff.[24] Prob.

0 0 0 0 0 0 0 1 81/256

0 0 0 1 0 1 0 0 27/256

0 0 1 0 0 1 0 0 27/256

0 0 1 1 0 0 -2 -1 9/256

0 1 0 0 0 0 -1 0 27/256

0 1 0 1 0 1 -1 0 9/256

0 1 1 0 0 1 -1 0 9/256

0 1 1 1 0 0 -3 0 3/256

1 0 0 0 0 1 0 0 27/256

1 0 0 1 0 0 -2 0 9/256

1 0 1 0 1 0 0 0 9/256

1 0 1 1 1 1 0 0 3/256

1 1 0 0 0 1 -1 -1 9/256

1 1 0 1 0 0 -3 0 3/256

1 1 1 0 1 0 -1 0 3/256

1 1 1 1 1 1 -1 -1 1/256

Outputs from the proposed design 1 are noted in Table Ⅲ. Diff.

in column 7 means that the difference between exact output value

and approximated output value in decimal. When the error rate is

simply calculated, it will be 10/16. However, note that all inputs are

from partial product so that probability of each of row in Table Ⅲ

has to be re-calculated. For example, if all inputs are ‘0’, then

the probability will be 81/256. This is reflected in Prob. in Table Ⅲ.

Thus, the actual error rate of the proposed design 1 is 32 %. This is

positive, it shows lower inaccurate results than other proposed

design while keeping delay of critical path 2Δ.

3.3 Proposed 4-2 Compressor Design 2

Fig. 11 shows the proposed approximate 4-2 compressor design

2. The Carry of the current stage has the same value with Cin from

１５

the previous stage in 24 out of 32 states as shown in Table Ⅱ.

Therefore, if Carry is fixed to have the same value as Cin, which is

denoted as Carry’ in Fig. 11, the error rate of Carry’ is just 25 %

(=8/32). The difference between proposed compressor design 1

and design 2 is that the Carry’ of design2 propagates to the last

compressor. In other word, if Cin of the first compressor is ‘0’,

there will be no carries. However, if Cin is ‘1’ from the first

compressor module, it will propagate all the way up to the very last

module of either compressor or adder.

Cout remains as it is since Cout and Carry are having one bit

higher weight than that of Sum. If Cout is approximated, the

difference between an erroneous value and exact value of the

output will be very large. Sum is approximated to Sum’ as shown

in Fig. 11 to reduce the delay and power of the design on the

critical path. The approximated Carry’ and Sum’ are expressed as

(2). The proposed approximate 4-2 compressor design 2 reduces

the number of XOR gates to one by approximating Sum and Cin.

Therefore, this design has a critical path of 2Δ which is 1Δ less

than the optimized exact design in Fig. 8 (b).

Carry’ = Cin

Sum’ = (X0 ⊕ X2) ∙ X3 (4)

Fig. 11. The Proposed Approximate 4-2 Compressor

Table Ⅳ shows the truth table of the proposed approximate 4-2

compressor design. Diff. in the last column represents the

difference between exact and approximate output values in decimal.

１６

Here, the output is the left side of (4). If Diff. is negative, an

inaccurate output is smaller than an exact output. The biggest

difference between exact and approximate outputs are ‘-2’

which is in states 4 and 16 in Table Ⅳ. Their input probabilities are

only 9/256 and 1/256, respectively, considering the input values

from partial products. Thus, the impact of errors in an approximate

multiplier including the proposed 4-2 compressors will be

acceptable. The error analysis will be given in Section 4.

１７

Table IV Truth Table of PACD 2

Cin X0 X1 X2 X3 Cout’ Carry’ Sum' Diff. Diff. [24]

Prob.

0 0 0 0 0 0 0 0 0 1 81/512

0 0 0 0 1 0 0 0 -1 0 27/512

0 0 0 1 0 0 0 0 -1 0 27/512

0 0 0 1 1 0 0 0 -2 -1 9/512

0 0 1 0 0 0 0 0 -1 0 27/512

0 0 1 0 1 0 0 1 -1 0 9/512

0 0 1 1 0 1 0 0 0 0 9/512

0 0 1 1 1 1 0 1 0 0 3/512

0 1 0 0 0 0 0 0 -1 0 27/512

0 1 0 0 1 0 0 1 -1 0 9/512

0 1 0 1 0 1 0 0 0 0 9/512

0 1 0 1 1 1 0 1 0 0 3/512

0 1 1 0 0 1 0 0 0 -1 9/512

0 1 1 0 1 1 0 0 -1 0 3/512

0 1 1 1 0 1 0 0 -1 0 3/512

0 1 1 1 1 1 0 0 -2 -1 1/512

1 0 0 0 0 0 1 0 1 0 81/512

1 0 0 0 1 0 1 0 0 -1 27/512

1 0 0 1 0 0 1 0 0 -1 27/512

1 0 0 1 1 0 1 0 -1 -2 9/512

1 0 1 0 0 0 1 0 0 -1 27/512

1 0 1 0 1 0 1 1 0 -1 9/512

1 0 1 1 0 1 1 0 1 -1 9/512

1 0 1 1 1 1 1 1 1 -1 3/512

1 1 0 0 0 0 1 0 0 -1 27/512

1 1 0 0 1 0 1 1 0 -1 9/512

1 1 0 1 0 1 1 0 1 -1 9/512

1 1 0 1 1 1 1 1 1 -1 3/512

1 1 1 0 0 1 1 0 1 -2 9/512

1 1 1 0 1 1 1 0 0 -1 3/512

1 1 1 1 0 1 1 0 0 -1 3/512

1 1 1 1 1 1 1 0 -1 -2 1/512

１８

3.4 Unsigned Dadda Tree Multiplier

The proposed approximate 4-2 compressor is applied to the

typical unsigned 8 × 8 Dadda tree multiplier in Fig. 12. The partial

products shown in black circles in Fig. 12 are from two input AND

gates and n partial products will be generated from n × n multiplier.

Eight 4-2 compressors are needed to reduce partial products in

step 1, whereas another ten 4-2 compressors are required in step

2 to reach final addition. Moreover, three half adders and three full

adders, which are not illustrated in Fig. 12, are needed. The

approximate compressor is applied on the least significant n bit and

the optimized exact compressors are used on the rest of n bit

calculation. For the proposed approximate compressor design 1, we

ignore and eliminate Carry so that there are no carry propagation in

step 1. The carries generated in the approximate compressor,

which is Cout’ will be moved to the step 2. At the same way, Carry

from the approximate compressor in step 2 will be ignored so that

there are no carry propagation as well. Approximate and exact

compressors are shown as dashed and solid boxes, respectively, in

Fig. 12. The mixed structure like this is to reduce the error

distance (ED) between exact and erroneous outputs. This

methodology is also used in [24],[25].

１９

Fig. 12. 8 × 8 Dadda Tree Multiplier using PACD 1

3.5 Signed Modified Baugh Wooley Multiplier

The proposed approximate 4-2 compressors are applied to the

typical Baugh Wooley multiplier for signed multiplications. The

typical signed 8 × 8 Baugh Wooley multiplication is shown in Fig.

13. The black solid dot means the partial product from AND

operation of two operand. The white dot means is bar of partial

product. For example, the original partial product is ‘1’, and then

bar of it is ‘0’. Then, they are reduced using half and full adders to

get the final results. In order to reduce the computational cost of

the multiplier, exact 4-2 compressors are implemented instead of

adders as the same way as Fig. 13. The difference from Fig. 12 is

carry propagation of the proposed compressor design 2. As

illustrated in Fig. 12, Cin equals Carry’. In other words, there are no

carry propagation among the proposed compressor design 2.

However, if the next module is either exact compressor or adders,

２０

the carry has to be propagated to the next module. If the Cin is ‘0’

from the very first module of the proposed compressor design 2, it

is the same as the Fig. 12. Else, the carry is ‘1’, the carry has to be

propagated to the exact compressor.

Fig. 13. 8 × 8 Signed Baugh Wooley Dadda Tree Multiplier using

PACD 2

3.6 Signed Radix-4 Booth Multiplier

The Booth multiplier is well known algorithm to calculate

multiplication efficiently [10] – [12]. The 4-radix booth multiplier

is used in the proposed multiplier. The second multiplicand is

encoded based on the Table Ⅴ. Therefore, the initial partial product

groups will be reduced by half. In 8 × 8 multiplication’s case, the

initial partial product group will be four.

２１

Table V Radix – 4 Booth Encoding

Groups Partial products

000 0

001 1*multiplicand

010 1*multiplicand

011 2*multiplicand

100 -2*multiplicand

101 -1*multiplicand

110 -1*multiplicand

111 0

Fig. 14 8 × 8 Signed Radix-4 Booth Multiplier using PACD 2

Applying PACD 2 on radix – 4 Booth multiplier is illustrated in

Fig. 14. In the step 1, only full and half adders are used to reduce

the circuitry. The carries are not propagated to the next module,

２２

they are moved to the next step. Therefore, delay can be decreased.

In the step 2, 4-2 compressors are used to reduce the circuitry.

For the last step, carry propagate adder is used to get the final

results.

3.7 A Modified MAC

The typical MAC illustrated in Fig. 7 computes one input data

with one weight. In order to reduce the number of DSP usage in

FPGA, [29] proposed a double MAC. However, the issue of double

MAC is only suitable for the small bit calculation. If the number of

bits in multiplication exceeds 27 × 18 which is DSP48E2

specification, two DSPs or one DSP with LUTs will be required.

Thus, this paper suggests a modified double MAC for efficient

FPGA resource utilization.

(a) (b)

Fig. 15. (a) Four 2 × 2 Kernel Convolution (b) A Modified MAC

Structure.

Fig. 15 shows a modified MAC structure. Fig. 15 (a) is an

example of four 2 × 2 kernel convolution with 1 stride. For the

conventional MAC, it uses only DSPs for the one kernel computation.

２３

Therefore, total 4 DSPs are required to compute four 2 x 2 kernel

CONV layer. However, for the deeper convolutions, which are used

for high accuracy classification or recognition, the number of DSPs

in FPGA may not be sufficient to cover all of the MAC operations in

CONV layers so that the FPGA tool will be automatically convert

rest of MAC operations to the LUTs when the design is synthesized.

This is not very efficient to convert DSPs to LUTs automatically by

the tool. However, the multipliers, such as Booth and Baugh Wooley,

well optimized multipliers for area, power consumption and delay.

Moreover, they can be further optimized by using approximate

compressors.

In this paper, the approximate compressor designs are applied on

the Booth and Baugh Wooley multipliers. They are implemented in

the modified MAC for high performance and efficient resource

utilization of FPGA.

3.8 VDSR Hardware Structure

CNN can be used for many applications. In this paper, the

modified MAC is implemented on VDSR as shown Fig. 16. An input

image from host PC is 256 pixel × 256 pixel and goes through 12

CONV layers in FPGA board. The number of bits in inputs is 14 bits

so that the previous multipliers are needed to extended to 14 bits.

Moreover, as most of CNNs do, outputs from MAC operations are

truncated to 14 bits. Input image and weight data will be stored in

DRAM in FPGA. ARM CPU in FPGA board receives command from

the host such as data transfer and running CNN in FPGA. When

Host gives ‘Run’ command to the FPGA, FPGA starts to run CNN

for super-resolution. The intermediate data from CONV layer will

be stored in on-chip SRAM. It works as buffer of CNN. An output

image will be 512 × 512 size super resolution image.

２４

Fig. 16. VDSR Hardware Structure

２５

Chapter 4. Evaluation Results

4.1 Approximate Compressors

The two proposed compressors of this paper and the exact 4-2

compressor [21] are implemented in Verilog and synthesized using

Synopsys Design Complier with TSMC 65nm standard cell library.

The simulation results are summarized of delay, power consumption

and power-delay product (PDP) as shown in Table Ⅵ. The

proposed approximate compressor design 2 (PACD 2) shows 52%

and 72% improvement in terms of delay power consumption

respectively. It is because the critical path of delay has been

reduced by 1Δ and the number of gate count is decreased.

Therefore, these are affect delay and power consumption in the

designs

Table VI Compressors Synthesized Results

Design Area

(μm2) Delay (ns)

Power (μW)

PDP ADP APP

Exact Design[21] 24.4 0.23 5.65 1.30 5.61 137.85

Design 2[24] 14.8 0.09 1.63 0.15 1.33 24.10

PACD 1 14.4 0.1 1.93 0.19 1.44 27.85

PACD 2 10.8 0.11 1.57 0.17 1.19 16.97

4.2 Approximate Compressors in Multipliers

The performance of the approximate compressors in multipliers,

where the proposed compressors are used, is evaluated. The

implementation and synthesized results are compared with when

previous compressor designs in [21], [24] are used. The proposed

compressors are to the unsigned 8 × 8 Dadda Tree multiplier for

fair comparison with exact and previous work. Moreover, the

proposed compressor design 2 are also applied on Booth and Baugh

Wooley multipliers to check if there is any dependency among

multipliers.

２６

Table VII Synthesized Results of Different Multiplier Designs

Design Multiplier Area

(μm2) Delay (ns)

Power (μW)

PDP (102)

ADP (103)

APP (105)

Exact

Design [21]

Dadda

Tree 772.8 1.31 205.73 2.70 1.01 1.59

Design 2 [24] Dadda

Tree 686.4 1.04 155.34 1.62 0.71 1.07

PACD 1 Dadda

Tree 682.8 1.07 151.73 1.62 0.73 1.04

PACD 2 Dadda

Tree 574 1.04 143.98 1.50 0.60 0.83

Exact

Design [21] Booth 986 2.21 254.57 5.63 2.18 2.51

Design 2 [24] Booth 947.6 1.76 227.58 4.01 1.67 2.16

PACD 1 Booth 946 1.76 231.19 4.07 1.66 2.19

PACD 2 Booth 931 1.76 219.68 3.87 1.64 2.05

Exact

Design [21] BW 772.4 1.32 213.36 2.82 1.02 1.65

Design 2 [24] BW 695.6 1.19 166.87 1.99 0.83 1.16

PACD 1 BW 692.4 1.19 164.86 1.96 0.82 1.14

PACD 2 BW 663.6 1.19 153.27 1.82 0.79 1.02

PDP, area delay product (ADP) and area power product (APP)

are shown in Table Ⅶ. To compare with the exact design, the

multipliers with the proposed approximate compressors are

improved in area, delay and power consumption. Area and power

consumption have been reduced by 25.7 % and 45 %, respectively

when comparing exact compressor and the proposed design 2

compressor on Dadda Tree multiplier. This is important because as

addressed in Section 1, many multipliers operate in parallel in most

applications and thus, area and power consumption are proportional

to the number of multipliers. Compared with the previous works in

[24], the proposed compressors achieve the similar power

consumption and delay performance as Design 1 and Design 2.

２７

However, the area of the the proposed compressor design 2 is

much smaller than that of Designs 1 and 2. When compared to

Design 2, ADP and APP are improved by 12.7 % and 12.3 %,

respectively, in the proposed compressor design 2. The synthesized

results show the difference among multipliers. Booth and Baugh

Wooley multiplier are compared because they are signed

multiplication so that they can be implemented on CNN. Booth

shows the most significantly improvement in terms of delay while

area is not reduced. Booth is 36 % faster than Baugh Wooley

because, Booth has only 4 partial product groups while Baugh

Wooley have 8 partial product groups. However, Baugh Wooley

shows better area and power efficiency.

4.3 Error Analysis

The error metrics that is used in this paper are from [23]. ED is

the difference between an exact output and erroneous output.

Normalized error distance (NED) is normalized by the maximum

value that the erroneous output can have, whereas Mean relative

error distance (MRED) is the average of relative error distance

(RED) which implies the ratio of ED to the accurate output. The

error rate is the ratio of inaccurate output states out of total 65,536

states. Mean square error (MSE) is also used to analyze error of

approximate multipliers. MSE is the most important metric for

image quality so that peak signal to noise ratio (PSNR) also uses

MSE to compare image quality between an original image and an

optimized image.

As presented in Table Ⅷ, the proposed PACD2 has the lowest

error rate in terms of MSE among different types of approximate

compressors for all types of multipliers. Moreover, PACDs show

lower MRED than that of Design 2 in [24] because PACDs are

100 % accurate when the exact outputs are ‘0’, whereas Design

2 is not. This is critical for some CNN applications, such as super-

resolution. In [28], it uses twelve convolution layers and there are

rectified linear unit (ReLU) functions as activation function right

after each convolution layers. If the outputs from previous layer are

‘0’ or negative, then inputs for the next layer will be ‘0’ after

２８

ReLU. Therefore, if the approximate outputs are not ‘0’ when

the exact values are ‘0’, the ED will increase along with passing

through convolution layers.

Table VIII Results of Error Analysis

Compressor Multiplier MSE MRED NED Error rate

Design 2 Dadda

Tree 1.1825 0.0522 0.0023 52.3%

PACD 1 Dadda

Tree 1.2192 0.0489 0.0036 45.8%

PACD 2 Dadda

Tree 1.0918 0.0418 0.0031 63.3%

Design 2 Booth 0.4118 0.0651 0.0065 40.5%

PACD 1 Booth 0.4154 0.0662 0.0063 33.8%

PACD 2 Booth 0.2865 0.0031 0.0045 28.6%

Design 2 BW 0.8506 0.1227 0.0112 63.2%

PACD 1 BW 0.9734 0.1090 0.0116 67.8%

PACD 2 BW 0.7922 0.1232 0.0109 63.0%

4.4 Multiplier Comparison in CNN application

The modified MAC is implemented CNN application, which is

VDSR in this paper. Fig. 17 shows VDSR experimental environment.

Host PC runs windows visual studio. It quantizes input image and

weights data and transfers to the FPGA. Host PC communicates

with FPGA through Xilinx Vivado SDK ver. 2017.03 tool. The Host

PC gives image and weights data to the FPGA and FPGA stores

them in DRAM. Once all of the data is available, the host sends

command to run CNN in FPGA. VDSR uses 14 x 14 multiplication

for CNN. The input 14 bits consist of 1 signed bit, 3 integer bit and

10 fraction bit. At first, the approximate compressors are applied

half of partial product which is 14 bits.

２９

Other experiments are split with multipliers, compressors and

how many numbers of bit are applied. An approximate compressor

is applied on 10 bits out of 28 bits since it means half of fraction

bits.

Fig. 17. VDSR Experimental Environment

Table Ⅸ shows PSNR comparison among various approximate

MACs. As expected based on error analysis, PACD 2 on Booth

multiplier shows the highest image quality with 37.6 dB as Lena

image. Typically, if PSNR is above 30dB, then the quality of the

image is good enough. When PACD 1 is compared with Design 2,

PSNR is higher even if MSE is lower. This is because Design 2 has

errors when the outputs are expected ‘0’ value. VDSR output image

results are illustrated in Fig. 18.

３０

Table IX Synthesized in FPGA Results

Multiplier Compressor Power(W) Delay(ns) PSNR

baby(dB) PSNR

Lena(dB)

Normal - 4.23 16.81 - -

Booth Exact 4.23 16.71 - -

BW Exact 4.24 16.62 - -

Booth PACD 1 4.23 16.05 33.48 33.88

BW PACD 2 4.23 15.64 31.5 31.91

Booth PACD 2 4.23 15.84 36.68 37.6

Booth Design 2 4.24 16.33 27.88 28.18

３１

(a) Exact Design

(b) PACD 1 on Booth (33.48dB) (c) PACD 2 on BW (31.5dB)

(d) PACD 2 on Booth (36.68dB) (e) Design 2 [24] on Booth

(27.88dB)

Fig. 18. VDSR Image Results

３２

Chapter 5. Conclusion

The two novel approximate 4-2 compressors are applied on

Booth multiplier and showing the best performance in terms of

super-resolution image qualification shown as PSNR. The final

summary of total experiment results are shown in Table Ⅹ. BW

multipliers are resulting better for area, delay, power consumption

than those of Booth multiplier. However, super-resolution image

results based on BW multipliers are not qualified. BW multipliers

have more partial product groups so that require more compressors

to reduce circuitry to get the final result.

Table X Summary of Experiment Results

Multiplier Compressor Area

(μm2) Delay (ns)

Power (μW)

MSE PSNR (dB)

Booth Design 2 947.6 1.76 227.58 0.4118 28.18

Booth PACD 1 946 1.76 231.19 0.4154 33.88

Booth PACD 2 931 1.76 219.68 0.2865 37.6

The reason of large difference between PSNRs of design 2 [24]

and PACD 1 even though they have the similar MSE is the design 2

has errors when the output value is expected as ‘0’. This is

important because most of convolution layers use ReLU as

activation function. When the results from MAC operations pass

through ReLU, the negative values become ‘0’ while the positive

values remains as they are.

PACD 2 on Booth also shows the smallest area and power

consumption among the approximate compressors. This is because

the number of logic gate count of PACD 2 is the least.

３３

Chapter 6. Discussion

State-of-arts Convolutional Neural Network (CNN) has more

convolution layers than those in the past to increase the accuracy of

classification and recognition. Many researches have focused on

reducing network size to save computational cost with keeping high

accuracy. It is also important to optimize a convolution layer itself

to reduce computational cost. This paper proposes approximate

computing using novel 4-2 compressors and applies on Baugh

Wooley and Booth multiplier. Moreover, a multiply and accumulate

(MAC) unit is modified for high efficient Field Programmable Gate

Array (FPGA) resources utilization. In addition, the modified MACs

are implemented on very deep convolutional network for image

super-resolution (VDSR) which have not been tried in previous

works. As results, the proposed approximate compressors shows

50% and 68% less delay and power consumption respectively than

the exact design. Moreover, the modified MAC shows 10%, 11%

improvement in terms of APP and ADP respectively based on

Synopsis Design complier synthesized results. The overall results

shows that delay and power consumption improvement with

meaningful PSNR value comparing with original image. In this paper,

two images are compared with accurate MAC. These two images

are ones of the most famous images when image quality is

compared.

３４

Bibliography

[1] Y. LeCun et al., "Backpropagation Applied to Handwritten Zip

Code Recognition," in Neural Computation, vol. 1, no. 4, pp.

541-551, Dec. 1989.

[2] Yann LeCun and Yoshua Bengio. 1998. Convolutional networks

for images, speech, and time series. In The handbook of brain

theory and neural networks, Michael A. Arbib (Ed.). MIT Press,

Cambridge, MA, USA 255-258.

[3] A. Krizhevsky, I. Sutskever, G. E. Hinton, “ Imagenet

classification with deep convolutional neural networks, ”

Advances in Neural Information Processing Systems, vol. 2, pp.

1097-1105, 2012.

[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for

image recognition. In CVPR, 2016.

[5] J. Qiu et al., “Going deeper with embedded fpga platform for

convolutional neural network,” in FPGA. ACM, 2016, pp. 26–

35.

[6] Kestur, Srinidhi & Davis, John & Williams, Oliver. (2010). BLAS

comparison on FPGA, CPU and GPU. Proceedings - IEEE

Annual Symposium on VLSI, ISVLSI 2010. 1. 288-293.

[7] E. Nurvitadhi, J. Sim, D. Sheffield, et. al., “ Accelerating

recurrent neural networks in analytics servers: Comparison of

FPGA, CPU, GPU, and ASIC,” Field Programmable Logic and

Applications (FPL), 2016.

[8] S. Che, J. Li, J. W. Sheaffer, K. Skadron and J. Lach,

"Accelerating Compute-Intensive Applications with GPUs and

FPGAs," 2008 Symposium on Application Specific Processors,

Anaheim, CA, 2008, pp. 101-107.

[9] E. Nurvitadhi, G. Venkatesh, J. Sim, et. al., “Can FPGAs Beat

GPUs in Accelerating Next-Generation Deep Neural

Networks?” International Symposium on Field-Programmable

Gate Arrays (ISFPGA), 2017.

３５

[10] Kyung-Ju Cho, Kwang-Chul Lee, Jin-Gyun Chung and K. K.

Parhi, "Design of low-error fixed-width modified booth

multiplier," in IEEE Transactions on Very Large Scale

Integration (VLSI) Systems, vol. 12, no. 5, pp. 522-531, May

2004.

[11] K. J. Cho, K. C. Lee, J. G. Chung, and K. K. Parhi, “Low

error fixedwidth modified Booth multiplier,” in Proc. IEEE

Workshop on Signal Processing Systems, San Diego, CA, Oct.

2002, pp. 45–50.

[12] W. Liu, L. Qian, C. Wang, H. Jiang, J. Han and F. Lombardi,

"Design of Approximate Radix-4 Booth Multipliers for Error-

Tolerant Computing," in IEEE Transactions on Computers, vol.

66, no. 8, pp. 1435-1441, 1 Aug. 2017.

[13] S. R. Chowdhury, A. Banerjee, A. Roy and H. Saha, "Design,

Simulation and Testing of a High Speed Low Power 15-4

Compressor for High Speed Multiplication Applications," 2008

First International Conference on Emerging Trends in

Engineering and Technology, Nagpur, Maharashtra, 2008, pp.

434-438.

[14] S. Che, J. Li, J. W. Sheaffer, K. Skadron and J. Lach,

“Accelerating Compute-Intensive Applications with GPUs and

FPGAs,” 2008 Symposium on Application Specific Processors,

Anaheim, CA, 2008, pp. 101-107.

[15] Lin X., Zhao C., Pan W. (2017) Towards accurate binary

convolutional neural network. In: NIPS, pp. 344–352

[16] S. Venkatachalam, E. Adams, H. J. Lee and S. Ko, “Design

and Analysis of Area and Power Efficient Approximate Booth

Multipliers,” in IEEE Transactions on Computers, vol. 68, no. 11,

pp. 1697-1703, 1 Nov. 2019.

[17] G. Zervakis, K. Tsoumanis, S. Xydis, D. Soudris and K.

Pekmestzi, “Design-Efficient Approximate Multiplication

Circuits Through Partial Product Perforation,” in IEEE

Transactions on Very Large Scale Integration (VLSI) Systems,

vol. 24, no. 10, pp. 3105-3117, Oct. 2016.

[18] D. Esposito, A. G. M. Strollo, E. Napoli, D. De Caro and N.

Petra, “Approximate Multipliers Based on New Approximate

３６

Compressors,” in IEEE Transactions on Circuits and Systems I:

Regular Papers, vol. 65, no. 12, pp. 4169-4182, Dec. 2018.

[19] P. Kulkarni, P. Gupta and M. Ercegovac, “Trading Accuracy

for Power with an Underdesigned Multiplier Architecture,” 2011

24th Internatioal Conference on VLSI Design, Chennai, 2011, pp.

346-351.

[20] C. Chang and R. K. Satzoda, “A Low Error and High

Performance Multiplexer-Based Truncated Multiplier,” in IEEE

Transactions on Very Large Scale Integration (VLSI) Systems,

vol. 18, no. 12, pp. 1767-1771, Dec. 2010.

[21] Chip-Hong Chang, Jiangmin Gu and Mingyan Zhang, “Ultra

low-voltage low-power CMOS 4-2 and 5-2 compressors for

fast arithmetic circuits,” in IEEE Transactions on Circuits and

Systems I: Regular Papers, vol. 51, no. 10, pp. 1985-1997, Oct.

2004.

[22] X. Yi, H. Pei, Z. Zhang, H. Zhou and Y. He, “Design of an

Energy-Efficient Approximate Compressor for Error-Resilient

Multiplications,” 2019 IEEE International Symposium on Circuits

and Systems (ISCAS), Sapporo, Japan, 2019, pp. 1-5.

[23] J. Liang, J. Han and F. Lombardi, “New Metrics for the

Reliability of Approximate and Probabilistic Adders,” in IEEE

Transactions on Computers, vol. 62, no. 9, pp. 1760-1771, Sept.

2013.

[24] A. Momeni, J. Han, P. Montuschi and F. Lombardi, “Design

and Analysis of Approximate Compressors for Multiplication,” in

IEEE Transactions on Computers, vol. 64, no. 4, pp. 984-994,

April 2015.

[25] Z. Yang, J. Han and F. Lombardi, “Approximate

compressors for error-resilient multiplier design,” 2015 IEEE

International Symposium on Defect and Fault Tolerance in VLSI

and Nanotechnology Systems (DFTS), Amherst, MA, 2015, pp.

183-186.

[26] Weinan Ma and Shuguo Li, “A new high compression

compressor for large multiplier,” 2008 9th International

Conference on Solid-State and Integrated-Circuit Technology,

Beijing, 2008, pp. 1877-1880.

３７

[27] R. Marimuthu, Y. E. Rezinold and P. S. Mallick, “Design and

Analysis of Multiplier Using Approximate 15-4 Compressor,” in

IEEE Access, vol. 5, pp. 1027-1036, 2017.

[28] D. Lee, S. Lee, H. S. Lee, H. Lee and K. Lee, “Context-

Preserving Filter Reorganization for VDSR-Based Super-

resolution,” 2019 IEEE International Conference on Artificial

Intelligence Circuits and Systems (AICAS), Hsinchu, Taiwan,

2019, pp. 107-111.

[29] D. Nguyen, D. Kim and J. Lee, "Double MAC: Doubling the

performance of convolutional neural networks on modern

FPGAs," Design, Automation & Test in Europe Conference &

Exhibition (DATE), 2017, Lausanne, 2017, pp. 890-893.

[30] Yarin Gal and Zoubin Ghahramani. Bayesian convolutional

neural networks with Bernoulli approximate variational

inference. arXiv preprint arXiv:1506.02158, 2015.

３８

Abstract

최신 Convolution Neural Network (CNN)의 동향을 보면, 지난

과거와 비교하여 인식과 분류의 정확도를 높이기 위해 더 많은

Convolution layer 를 가지고 있는 것이 특징이다. 그 동안 높은

정확도를 유지하면서, 연산량을 줄이기 위해서는 Network 의 크기 그

자체를 줄이려는 시도들도 있었고, 하나의 Network 에서 Approximate

Computing 방법을 이용하여 줄이려는 시도들도 있었다. 본 논문에서는

새로운 4-2 Compressor 를 고안하여 이를 기존에 잘 알려진 Baugh

Wooley 나 Booth 곱셈기에 적용하는 Approximate computing 방법을

제안하였다. Convolution layer 는 곱셈과 그 결과 값을 누적으로

더하는 동작으로 이루어져 있고, 이를 MAC 이라고 한다. 본 논문에서

제안하는 Approximate Compressor 가 적용된 Multiplier 를 MAC 에

적용하고, FPGA 의 Resource 의 효율적인 배분을 위하여 MAC 을

수정하였다. 그 결과 기존의 정확한 Compressor 와 비교 시에 Delay 와

Power 측면에서 각각 50%, 68% 의 향상이 있었다. 뿐만 아니라,

MAC 에 적용하여 비교하였을 시에는 APP 와 ADP 기준으로 각각

10%와 11% 감소하였다. 최종적으로 MAC 을 VDSR 하드웨어에

적용하여 Super-Resolution 된 이미지를 검증하였다.

disclaimers-space.snu.ac.kr/bitstream/10371/166587/1/000000160550.pdf · 2020-05-18 · mac is...

Documents