elearn: edge learning processor with bidirectional
TRANSCRIPT
ELearn: Edge Learning Processor with Bidirectional Speculation and Sparsity & Mixed-Precision aware
Dataflow Parallelism Reconfiguration
Fengbin Tu1, Weiwei Wu1, Yang Wang1, Hongjiang Chen1, Feng Xiong1, Man Shi1, Ning Li1,
Jinyi Deng1, Tianbao Chen2, Leibo Liu1, Shaojun Wei1, Shouyi Yin1
1Institute of Microelectronics, Tsinghua University, Beijing, China
2TsingMicro Tech, Beijing, China
Abstract
To achieve higher accuracy on local devices, the necessity of retraining a DNN
model on edge increases. This paper proposes a sparsity and mixed-precision aware
edge learning processor ELearn. With bidirectional speculation for runtime data
sparsity in training's feedforward and backpropagation passes, ELearn reduces 63.3%
computation and 85.5% memory access. A parallelism reconfigurable engine and a
runtime parallelism optimizer are designed to match the time-varying workload
parallelism caused by sparsity and mixed-precision. Our techniques achieves 3.60x
higher energy efficiency and 1.44x higher PE utilization. ELearn is fabricated in
28nm CMOS and achieves energy efficiency of 3.8-to-172.8 TOPS/W.
1
Edge Learning
• We usually pretrain a mixed-precision DNN model on cloud and then
deploy it on edge devices. This would cause accuracy degradation in
complex local environments. However, sending local data to the cloud
for retraining incurs high transmission cost and privacy issues.
• Retraining mixed-precision models directly on edge is essential to
adapt DNNs to local environments.
2
Time-Varying Workload Parallelism
• We exploit training’s runtime data sparsity (RTDS) and mixed-
precision computation for efficient edge learning.
• RTDS and mixed-precision causes time-varying workload parallelism.
• Dynamic hardware reconfigurability to match the runtime changes.
(IP: Input Parallelism, OP: Output Parallelism, PP: Precision Parallelism).
3
Time-Varying Workload Parallelism
Layer-wiseMixed-Precision
Tile-wise DataflowParallelism Reconfiguration
Feature/Error RTDS
PP PP
PP PP
IP
OP
OP
IP
PP
PE Array Config.1 PE Array Config.2
Varying Non-Zero Count
PP
PP PP
PP PP
PPPP
PP OP
Non-ZeroZero
4
ELearn’s Overall Architecture
x4
Runtime Parallelism
Optimizer (RPO)CTRLInput Buffer (152KB)
PE Arrays
1024b
Ps
um
Input Allocator
NZ
-re
late
d
Psu
m
NZ Kernel
Extractor
ComparatorNZ
PositionEncoderNZ Psum
Buffer
Parallelism-Reconfigurable Engine (PRE)
Bidirectional Speculation Unit (BSU)
Input Sharing
PE0
Weight
PE1 PE2 PE7
NZ
-rela
ted
K
ern
el A
dd
res
s
RTDS Mask
We
igh
t B
uff
er
(25
6K
B)
Psum Vertical Accumulation
10
24
b x
4
+
+
+
+
+
+
+
+
+
+
+
+
PEPE PE
PEPE PE
PEPE PE
Co
nfi
gu
rati
on
W
ord
Ou
tpu
t A
gg
reg
ato
r
Three Design Features
1. A Bidirectional Speculation Unit (BSU) that predicts RTDS and skips
zero-output computation in both FF and BP passes of training.
2. A Parallelism-Reconfigurable Engine (PRE) that enables dataflow with
flexible input, output and precision parallelism.
3. A Runtime Parallelism Optimizer (RPO) that dynamically optimizes
PRE's dataflow parallelism to match the time-varying workload
parallelism caused by RTDS and mixed-precision.
5
6
Bidirectional Speculation Mechanism
W1
High-bit Psum X1
Low-bit Psum X1
X1
Tile
dX1
Error Map
dY2
High-bitX0 (X0,HB)
Masked W1 (W1,m)
<< Width of Low Bits
RTDS Mask
101001
FF Speculation
BP Speculation
Feature Map
Low-bitX0 (X0,LB)
3 2 1 07 6 5 4 7 6 5 4 3 2 1 0
(WT )2,mMasked W2T
Skip dX1 computation, since X1, dY1 = 0 Compute dX1 = dY2 x WT 2,m
Compute X0,LB x W1,mX0,HB x W1 > 0
X1 = 0X0,HB x W1 0
FF (High-bit ) Speculation Dataflow
FF (Low-bit ) or BP Speculation Dataflow
Bidirectional Speculation
X0[ ] X0[ ]
7
Same Zero Distribution in X and dY
Results
Errors
Layer1 Layer2
X0
X1 X2
dX2dX1
Ground Truth
dW1 dW2RTDS Mask
Reuse
Y1CONV, W1 ReLU Y2CONV, W2 ReLU
dY1CONV, ReLU dY2CONV, ReLU WT1 WT
2
dY1 = dX1 x ReLU (Y1)
1,
0,
Y1 > 0 X1 > 0
Y1 0 X1 = 0
ReLU (Y1) =
If X1 = 0,
dY1 = 0, skip dX1
ReLU/ReLU
Feedforward (FF)
Backpropagation (BP)
8
Bidirectional Speculation Unit (BSU)
Bank Addr
>>5 <<5
+-
Row Addr
NZ Kernel Extractor
992b x32
0/10/10/1
Co
mp
are
1
00
Concat
NZ Position Encoder
20 5
Dynamic Zero Compression
Offset
31b
-14 3 -1 -2 2
10 2 3 4 5
Data
34 2
20 5
Data
Position
NZ Psum Buffer Index[0]
Index[0:31]
Index[1:31]
BSU
5b
NZ
-re
late
d P
su
m
101001
Psum
x31
0
Position
RTDS Mask
NZP NZ-related Kernel Addr
Bank Addr = NZP mod 32
Row Addr = x Offset NZP
32
NZPCode
Parallelism-Reconfigurable Engine (PRE)
• IP: The input allocator (IA) multicasts input segments to PEs.
• OP: The output aggregator (OA) merges partial sums.
• PP: The PE supports INT2/4/8 input-weight multiplication.
9
Input Allocator (IA)
10
24
b
Inp
ut
Bu
ffer
(152K
B)
Bitwidth-Adaptive Serializer: 1024/(IPxPPi) Segments
IProw=1
Row 0,0
IProw=16
0
or or
Output Aggregator (OA)
Multicast to PE Rows
PE Row 0,0
PE Row 0,1
PE Row 0,2
PE Row 0,3
PE Row 0,4
PE Row 0,5
PE Row 0,7
PE Row 0,6
PE Array Vertical Accumulation
OP=32,16,8 OP=64 OP=128
Row 0,1
Row 3,7
Row 0,0Row 2,0
Row 1,7Row 3,7
IProw=32
Row 0,0Row 0,1
Row 3,7
0115 031 130
01IProw-1
10
24b
x4
OP
Weig
ht
Bu
ffer
(25
6K
B )
8b 8b 8b 8b
Segment Length IPxPPi b
SubWord Length IPpexPPi b
IProw SubWords
Precision-Fusible PE
MUL MUL
MUL
MUL
MUL MUL MUL MUL
MUL
MUL
MUL
MUL
MUL
MUL
MUL MUL
Row 3,6
From RPO
+,<< +,<<
+,<<
+,<<+,<<
PPi x PPw
Independent MULsFuse every 2 MULsFuse every 4 MULsFuse every 8 MULs
2x22x4, 4x2
4x8, 8x48x8
2x8, 4x4, 8x2
Fuse all 16 MULs
IP OP PPConfiguration Flow
IP x PPi
Configuration Word
256-to-128 Selector
128-to-64 Selector
64-to-32 Selector
32-to-16 Adder
16-to-8 Adder
PE Array Outputs
PE
RTDS & Mixed-Precision aware Reconfiguration
Tile-wise configuration word
• IP: Number of input data simultaneously fed to the PEs.
• OP: Number of output data concurrently produced by the PEs.
• PP: Product of input and weight precision.
10
Tile-wise Configuration Word
PPi PPWIPpeIProw OP
Input Channel (Ci)
Varying RTDS
NZ Output Channel (Co)
Layer-wiseMixed-Precision
Runtime Parallelism Optimizer (RPO)
RPO finds out the optimal IP-OP pair to maximize PE utilization, based
on the input channel count 𝐶𝑖, NZ output channel count 𝐶𝑜, and current
precision PP.
11
RPO
Precision (PPi x PPw)
Input Channel
(Ci)
RTDSMask
>> Log2OP
<< Log2OP
+-
0
IProw>> Log2OP
<
Default OP=256
OP=OP>>1
>
10
1 0
OP= 8?
YES
END
NO
>> Log2IP
<< Log2IP
+-
0<
IPpe PPi x PPw
168
41
2 x 22 x 4
8 x 28 x 8
IPpe
LUT
OPExploration
Opt.OP
IP
256
Ci
IP
Co
101001
Ad
de
r
Co
OP
OP
OP
To
PR
EC
on
fig
ura
tio
n W
ord
𝑃𝐸 𝑈𝑡𝑖𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 = 𝐶𝑖 × 𝐶𝑜𝐶𝑖𝐼𝑃 × 𝐶𝑜𝑂𝑃 × 𝐼𝑃 × 𝑂𝑃
ELearn’s Reconfiguration Time Chart
Reconfigure once for each tile.
• FF: RPO generates a configuration word for PRE's low-bit computation.
• BP: The previously FF configurations are directly reused for each tile.
12
PRE
BSU
FF BP
Tile0 Tile1 Tile0 Tile1
RPO
FF Low-bit Computation (Tile-wise)
FF High-bit Computation (Tile-wise)
FF High-bit Speculation (32 Points)
BP Non-Zero Computation (Tile-wise)
Tile-wise RTDS Mask
FF Tile-wise Reconfiguration
BP Tile-wise Reconfiguration (RTDS Reuse)
13
ELearn’s Chip Micrograph and Summary
Retraining Results on Private CIFAR100
ELearn improves Top1 accuracy of VGG8 and AlexNet by 21.90% and
11.45% after retraining 500 iterations on the private CIFAR100 dataset.
14
AccuracyCloud
Original
Edge
Before Retrain
Edge
After Retrain
VGG8 68.93% 46.71% 68.61% (+21.90%)
AlexNet 68.34% 56.46% 67.91% (+11.45%)
Experiment Setup
Private Dataset: 5000+1000
Batch Size: 100
500 Iterations = 10 Epoches
Weight Precision:
VGG8, INT2-4-8-8-4-8-4-8
AlexNet, INT8-4-8-4-8-8-8-8
Feature, Error Precision: INT8
Gradient Precision: INT16
15
Energy Efficiency and PE Utilization Analysis
16
Comparison with State-of-the-art DNN Training Processors
VLSI’19 [1] ISSCC’19 [2] ASSCC’19 [3] ASSCC’19 [4] VLSI’18 [5] ISSCC’20 [6] This Work
Sparsity Support × √ × × × √ √Dynamic Reconfig. × × × × × × √Technology 65nm 65nm 65nm 40nm 14nm 65nm 28nm
Die Area (mm²) 5.76 16 10.24 5 9 32.4 5.64
Data Precision INT13/16 FP8/16 INT8 INT5/10, FP16 INT2/3, FP16 FP8/16 INT2/4/8
Frequency (MHz) 50-200 50-200 5-160 82-232 750-1500 25-200 120-268
Supply Voltage (V) 0.78-1.1 0.78-1.1 0.63-1.0 0.6-0.9 0.58-0.9 0.70-1.1 0.75-1.1
Power Consumption(mW)
36.8 @0.78V, 50MHz
43.1 @0.78V, 50MHz
9.55 @0.63V, 5MHz
18.7 @0.6V, 82MHz (INFER, INT5/10)
-58 @0.70V,
25MHz6.75 @0.75V,
120MHz
252.4 @1.1V, 200MHz
367 @1.1V, 200MHz
120.5 @ 1.0 V,80MHz
64.5 @0.6V, 82MHz (TRAIN, FP16)
-647 @1.1V,
200MHz36 @1.1V,268MHz
Peak Performance(TOPS or TFLOPS)
0.155 @INT13/16
>0.3 @FP16
0.13 @INT8 -
1.5 @FP16 14.03 @FP16 0.137 @INT8
12 @INT3
>0.6 @FP8 24.13 @FP8 2.195 @INT224 @INT2
Energy Efficiency(TOPS/W or TFLOPS/W)
1.32 @INT13/16
15.6 @FP16
1.03 @INT8
2.25 (INFER, Norm. to INT16)
-
1.81-75.68 @FP16
32.9 @INT8
130.6 @INT4
25.3 @FP80.65 (TRAIN,
Norm. to INT16)3.62-135.10
@FP8 172.8 @INT2
1) @1.1V, 268MHz; 2) @0.8V, 232MHz, VGG8 retraining.
1)
2)
2)
2)
1)
References
[1] D. Han, J. Lee, J. Lee, and H. Yoo, “A 1.32 tops/w energy efficient deep neural network learningprocessor with direct feedback alignment based heterogeneous core architecture,” in VLSI, 2019.
[2] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H.-J. Yoo, “lnpu: A 25.3 tflops/w sparse deep-neural-network learning processor with fine-grained mixed precision of fp8-fp16,” in ISSCC, 2019.
[3] S. Choi, J. Sim, M. Kang, Y. Choi, H. Kim, and L.-S. Kim, “A 47.4j/epoch trainable deepconvolutional neural network accelerator for in-situ personalization on smart devices,” in ASSCC,2019.
[4] C.-H. Lu, Y.-C. Wu, and C.-H. Yang, “A 2.25 tops/w fully-integrated deep cnn learning processorwith on-chip training,” in ASSCC, 2019.
[5] B. Fleischer, S. Shukla, M. Ziegler, J. Silberman, J. Oh, V. Srinivasan, J. Choi, S. Mueller, A.Agrawal, T. Babinsky, N. Cao, C. Chen, P. Chuang, T. Fox, G. Gristede, M. Guillorn, H. Haynie, M.Klaiber, D. Lee, S. Lo, G. Maier, M. Scheuermann, S. Venkataramani, C. Vezyrtzis, N. Wang, F. Yee, C.Zhou, P. Lu, B. Curran, L. Chang, and K. Gopalakrishnan, “A scalable multi- teraops deep learningprocessor core for ai trainina and inference,” in VLSI, 2018.
[6] S. Kang, D. Han, J. Lee, D. Im, S. Kim, and H.-J. Yoo, “Ganpu: A 135tflops/w multi-dnn trainingprocessor for gans with speculative dual-sparsity exploitation,” in ISSCC, 2020.
17