elearn: edge learning processor with bidirectional

ELearn: Edge Learning Processor with Bidirectional Speculation and Sparsity & Mixed-Precision aware

Dataflow Parallelism Reconfiguration

Fengbin Tu1, Weiwei Wu1, Yang Wang1, Hongjiang Chen1, Feng Xiong1, Man Shi1, Ning Li1,

Jinyi Deng1, Tianbao Chen2, Leibo Liu1, Shaojun Wei1, Shouyi Yin1

1Institute of Microelectronics, Tsinghua University, Beijing, China

2TsingMicro Tech, Beijing, China

Abstract

To achieve higher accuracy on local devices, the necessity of retraining a DNN

model on edge increases. This paper proposes a sparsity and mixed-precision aware

edge learning processor ELearn. With bidirectional speculation for runtime data

sparsity in training's feedforward and backpropagation passes, ELearn reduces 63.3%

computation and 85.5% memory access. A parallelism reconfigurable engine and a

runtime parallelism optimizer are designed to match the time-varying workload

parallelism caused by sparsity and mixed-precision. Our techniques achieves 3.60x

higher energy efficiency and 1.44x higher PE utilization. ELearn is fabricated in

28nm CMOS and achieves energy efficiency of 3.8-to-172.8 TOPS/W.

1

Edge Learning

• We usually pretrain a mixed-precision DNN model on cloud and then

deploy it on edge devices. This would cause accuracy degradation in

complex local environments. However, sending local data to the cloud

for retraining incurs high transmission cost and privacy issues.

• Retraining mixed-precision models directly on edge is essential to

adapt DNNs to local environments.

2

Time-Varying Workload Parallelism

• We exploit training’s runtime data sparsity (RTDS) and mixed-

precision computation for efficient edge learning.

• RTDS and mixed-precision causes time-varying workload parallelism.

• Dynamic hardware reconfigurability to match the runtime changes.

(IP: Input Parallelism, OP: Output Parallelism, PP: Precision Parallelism).

3

Time-Varying Workload Parallelism

Layer-wiseMixed-Precision

Tile-wise DataflowParallelism Reconfiguration

Feature/Error RTDS

PP PP

PP PP

IP

OP

OP

IP

PP

PE Array Config.1 PE Array Config.2

Varying Non-Zero Count

PP

PP PP

PP PP

PPPP

PP OP

Non-ZeroZero

4

ELearn’s Overall Architecture

x4

Runtime Parallelism

Optimizer (RPO)CTRLInput Buffer (152KB)

PE Arrays

1024b

Ps

um

Input Allocator

NZ

-re

late

d

Psu

m

NZ Kernel

Extractor

ComparatorNZ

PositionEncoderNZ Psum

Buffer

Parallelism-Reconfigurable Engine (PRE)

Bidirectional Speculation Unit (BSU)

Input Sharing

PE0

Weight

PE1 PE2 PE7

NZ

-rela

ted

K

ern

el A

dd

res

s

RTDS Mask

We

igh

t B

uff

er

(25

6K

B)

Psum Vertical Accumulation

10

24

b x

4

+

+

+

+

+

+

+

+

+

+

+

+

PEPE PE

PEPE PE

PEPE PE

Co

nfi

gu

rati

on

W

ord

Ou

tpu

t A

gg

reg

ato

r

Three Design Features

1. A Bidirectional Speculation Unit (BSU) that predicts RTDS and skips

zero-output computation in both FF and BP passes of training.

2. A Parallelism-Reconfigurable Engine (PRE) that enables dataflow with

flexible input, output and precision parallelism.

3. A Runtime Parallelism Optimizer (RPO) that dynamically optimizes

PRE's dataflow parallelism to match the time-varying workload

parallelism caused by RTDS and mixed-precision.

5

6

Bidirectional Speculation Mechanism

W1

High-bit Psum X1

Low-bit Psum X1

X1

Tile

dX1

Error Map

dY2

High-bitX0 (X0,HB)

Masked W1 (W1,m)

<< Width of Low Bits

RTDS Mask

101001

FF Speculation

BP Speculation

Feature Map

Low-bitX0 (X0,LB)

3 2 1 07 6 5 4 7 6 5 4 3 2 1 0

(WT )2,mMasked W2T

Skip dX1 computation, since X1, dY1 = 0 Compute dX1 = dY2 x WT 2,m

Compute X0,LB x W1,mX0,HB x W1 > 0

X1 = 0X0,HB x W1 0

FF (High-bit ) Speculation Dataflow

FF (Low-bit ) or BP Speculation Dataflow

Bidirectional Speculation

X0[ ] X0[ ]

7

Same Zero Distribution in X and dY

Results

Errors

Layer1 Layer2

X0

X1 X2

dX2dX1

Ground Truth

dW1 dW2RTDS Mask

Reuse

Y1CONV, W1 ReLU Y2CONV, W2 ReLU

dY1CONV, ReLU dY2CONV, ReLU WT1 WT

2

dY1 = dX1 x ReLU (Y1)

1,

0,

Y1 > 0 X1 > 0

Y1 0 X1 = 0

ReLU (Y1) =

If X1 = 0,

dY1 = 0, skip dX1

ReLU/ReLU

Feedforward (FF)

Backpropagation (BP)

8

Bidirectional Speculation Unit (BSU)

Bank Addr

>>5 <<5

+-

Row Addr

NZ Kernel Extractor

992b x32

0/10/10/1

Co

mp

are

1

00

Concat

NZ Position Encoder

20 5

Dynamic Zero Compression

Offset

31b

-14 3 -1 -2 2

10 2 3 4 5

Data

34 2

20 5

Data

Position

NZ Psum Buffer Index[0]

Index[0:31]

Index[1:31]

BSU

5b

NZ

-re

late

d P

su

m

101001

Psum

x31

0

Position

RTDS Mask

NZP NZ-related Kernel Addr

Bank Addr = NZP mod 32

Row Addr = x Offset NZP

32

NZPCode

Parallelism-Reconfigurable Engine (PRE)

• IP: The input allocator (IA) multicasts input segments to PEs.

• OP: The output aggregator (OA) merges partial sums.

• PP: The PE supports INT2/4/8 input-weight multiplication.

9

Input Allocator (IA)

10

24

b

Inp

ut

Bu

ffer

(152K

B)

Bitwidth-Adaptive Serializer: 1024/(IPxPPi) Segments

IProw=1

Row 0,0

IProw=16

0

or or

Output Aggregator (OA)

Multicast to PE Rows

PE Row 0,0

PE Row 0,1

PE Row 0,2

PE Row 0,3

PE Row 0,4

PE Row 0,5

PE Row 0,7

PE Row 0,6

PE Array Vertical Accumulation

OP=32,16,8 OP=64 OP=128

Row 0,1

Row 3,7

Row 0,0Row 2,0

Row 1,7Row 3,7

IProw=32

Row 0,0Row 0,1

Row 3,7

0115 031 130

01IProw-1

10

24b

x4

OP

Weig

ht

Bu

ffer

(25

6K

B )

8b 8b 8b 8b

Segment Length IPxPPi b

SubWord Length IPpexPPi b

IProw SubWords

Precision-Fusible PE

MUL MUL

MUL

MUL

MUL MUL MUL MUL

MUL

MUL

MUL

MUL

MUL

MUL

MUL MUL

Row 3,6

From RPO

+,<< +,<<

+,<<

+,<<+,<<

PPi x PPw

Independent MULsFuse every 2 MULsFuse every 4 MULsFuse every 8 MULs

2x22x4, 4x2

4x8, 8x48x8

2x8, 4x4, 8x2

Fuse all 16 MULs

IP OP PPConfiguration Flow

IP x PPi

Configuration Word

256-to-128 Selector

128-to-64 Selector

64-to-32 Selector

32-to-16 Adder

16-to-8 Adder

PE Array Outputs

PE

RTDS & Mixed-Precision aware Reconfiguration

Tile-wise configuration word

• IP: Number of input data simultaneously fed to the PEs.

• OP: Number of output data concurrently produced by the PEs.

• PP: Product of input and weight precision.

10

Tile-wise Configuration Word

PPi PPWIPpeIProw OP

Input Channel (Ci)

Varying RTDS

NZ Output Channel (Co)

Layer-wiseMixed-Precision

Runtime Parallelism Optimizer (RPO)

RPO finds out the optimal IP-OP pair to maximize PE utilization, based

on the input channel count 𝐶𝑖, NZ output channel count 𝐶𝑜, and current

precision PP.

11

RPO

Precision (PPi x PPw)

Input Channel

(Ci)

RTDSMask

>> Log2OP

<< Log2OP

+-

0

IProw>> Log2OP

<

Default OP=256

OP=OP>>1

>

10

1 0

OP= 8?

YES

END

NO

>> Log2IP

<< Log2IP

+-

0<

IPpe PPi x PPw

168

41

2 x 22 x 4

8 x 28 x 8

IPpe

LUT

OPExploration

Opt.OP

IP

256

Ci

IP

Co

101001

Ad

de

r

Co

OP

OP

OP

To

PR

EC

on

fig

ura

tio

n W

ord

𝑃𝐸 𝑈𝑡𝑖𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 = 𝐶𝑖 × 𝐶𝑜𝐶𝑖𝐼𝑃 × 𝐶𝑜𝑂𝑃 × 𝐼𝑃 × 𝑂𝑃

ELearn’s Reconfiguration Time Chart

Reconfigure once for each tile.

• FF: RPO generates a configuration word for PRE's low-bit computation.

• BP: The previously FF configurations are directly reused for each tile.

12

PRE

BSU

FF BP

Tile0 Tile1 Tile0 Tile1

RPO

FF Low-bit Computation (Tile-wise)

FF High-bit Computation (Tile-wise)

FF High-bit Speculation (32 Points)

BP Non-Zero Computation (Tile-wise)

Tile-wise RTDS Mask

FF Tile-wise Reconfiguration

BP Tile-wise Reconfiguration (RTDS Reuse)

13

ELearn’s Chip Micrograph and Summary

Retraining Results on Private CIFAR100

ELearn improves Top1 accuracy of VGG8 and AlexNet by 21.90% and

11.45% after retraining 500 iterations on the private CIFAR100 dataset.

14

AccuracyCloud

Original

Edge

Before Retrain

Edge

After Retrain

VGG8 68.93% 46.71% 68.61% (+21.90%)

AlexNet 68.34% 56.46% 67.91% (+11.45%)

Experiment Setup

Private Dataset: 5000+1000

Batch Size: 100

500 Iterations = 10 Epoches

Weight Precision:

VGG8, INT2-4-8-8-4-8-4-8

AlexNet, INT8-4-8-4-8-8-8-8

Feature, Error Precision: INT8

Gradient Precision: INT16

15

Energy Efficiency and PE Utilization Analysis

16

Comparison with State-of-the-art DNN Training Processors

VLSI’19 [1] ISSCC’19 [2] ASSCC’19 [3] ASSCC’19 [4] VLSI’18 [5] ISSCC’20 [6] This Work

Sparsity Support × √ × × × √ √Dynamic Reconfig. × × × × × × √Technology 65nm 65nm 65nm 40nm 14nm 65nm 28nm

Die Area (mm²) 5.76 16 10.24 5 9 32.4 5.64

Data Precision INT13/16 FP8/16 INT8 INT5/10, FP16 INT2/3, FP16 FP8/16 INT2/4/8

Frequency (MHz) 50-200 50-200 5-160 82-232 750-1500 25-200 120-268

Supply Voltage (V) 0.78-1.1 0.78-1.1 0.63-1.0 0.6-0.9 0.58-0.9 0.70-1.1 0.75-1.1

Power Consumption(mW)

36.8 @0.78V, 50MHz

43.1 @0.78V, 50MHz

9.55 @0.63V, 5MHz

18.7 @0.6V, 82MHz (INFER, INT5/10)

-58 @0.70V,

25MHz6.75 @0.75V,

120MHz

252.4 @1.1V, 200MHz

367 @1.1V, 200MHz

120.5 @ 1.0 V,80MHz

64.5 @0.6V, 82MHz (TRAIN, FP16)

-647 @1.1V,

200MHz36 @1.1V,268MHz

Peak Performance(TOPS or TFLOPS)

0.155 @INT13/16

>0.3 @FP16

0.13 @INT8 -

1.5 @FP16 14.03 @FP16 0.137 @INT8

12 @INT3

>0.6 @FP8 24.13 @FP8 2.195 @INT224 @INT2

Energy Efficiency(TOPS/W or TFLOPS/W)

1.32 @INT13/16

15.6 @FP16

1.03 @INT8

2.25 (INFER, Norm. to INT16)

-

1.81-75.68 @FP16

32.9 @INT8

130.6 @INT4

25.3 @FP80.65 (TRAIN,

Norm. to INT16)3.62-135.10

@FP8 172.8 @INT2

1) @1.1V, 268MHz; 2) @0.8V, 232MHz, VGG8 retraining.

1)

2)

2)

2)

1)

References

[1] D. Han, J. Lee, J. Lee, and H. Yoo, “A 1.32 tops/w energy efficient deep neural network learningprocessor with direct feedback alignment based heterogeneous core architecture,” in VLSI, 2019.

[2] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H.-J. Yoo, “lnpu: A 25.3 tflops/w sparse deep-neural-network learning processor with fine-grained mixed precision of fp8-fp16,” in ISSCC, 2019.

[3] S. Choi, J. Sim, M. Kang, Y. Choi, H. Kim, and L.-S. Kim, “A 47.4j/epoch trainable deepconvolutional neural network accelerator for in-situ personalization on smart devices,” in ASSCC,2019.

[4] C.-H. Lu, Y.-C. Wu, and C.-H. Yang, “A 2.25 tops/w fully-integrated deep cnn learning processorwith on-chip training,” in ASSCC, 2019.

[5] B. Fleischer, S. Shukla, M. Ziegler, J. Silberman, J. Oh, V. Srinivasan, J. Choi, S. Mueller, A.Agrawal, T. Babinsky, N. Cao, C. Chen, P. Chuang, T. Fox, G. Gristede, M. Guillorn, H. Haynie, M.Klaiber, D. Lee, S. Lo, G. Maier, M. Scheuermann, S. Venkataramani, C. Vezyrtzis, N. Wang, F. Yee, C.Zhou, P. Lu, B. Curran, L. Chang, and K. Gopalakrishnan, “A scalable multi- teraops deep learningprocessor core for ai trainina and inference,” in VLSI, 2018.

[6] S. Kang, D. Han, J. Lee, D. Im, S. Kim, and H.-J. Yoo, “Ganpu: A 135tflops/w multi-dnn trainingprocessor for gans with speculative dual-sparsity exploitation,” in ISSCC, 2020.

17

elearn: edge learning processor with bidirectional

Documents