s8822 optimizing nmt with tensorrton-demand.gputechconf.com/gtc/2018/presentation/... · batch beam...

40
Micah Villmow Senior TensorRT Software Engineer S8822 – OPTIMIZING NMT WITH TENSORRT

Upload: others

Post on 04-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

Micah Villmow

Senior TensorRT Software Engineer

S8822 – OPTIMIZING NMT WITH TENSORRT

Page 2: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

2

100倍以上速く、本当に可能ですか?

2

Page 3: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

3

Neural Machine

Translation Unit

DOUGLAS ADAMS – BABEL FISH

Page 4: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

4

OVER 100X FASTER,IS IT REALLY POSSIBLE?

Over 200

years

4

Page 5: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

5

NVIDIA TENSORRTProgrammable Inference Accelerator

developer.nvidia.com/tensorrt

DRIVE PX 2

JETSON TX2

NVIDIA DLA

TESLA P4

TESLA V100

FRAMEWORKS GPU PLATFORMS

TensorRT

Optimizer Runtime

Page 6: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

6

• Convolution

• LSTM and GRU

• Activation: ReLU, tanh, sigmoid

• Pooling: max and average

• Scaling

• Element wise operations

• LRN

• Fully-connected

• SoftMax

• Deconvolution

TENSORRT LAYERS

Built-in Layer Support Custom Layer API

CUDA Runtime

Deployed Application

TensorRT Runtime

Custom Layer

Page 7: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

7

TENSORRT OPTIMIZATIONS

Kernel Auto-Tuning

Layer & Tensor Fusion

Dynamic Tensor

Memory

Weights & Activation

Precision Calibration

140 305

5700

14 ms

6.67 ms 6.83 ms

0

5

10

15

20

25

30

35

40

0

1,000

2,000

3,000

4,000

5,000

6,000

CPU-Only V100 +TensorFlow

V100 + TensorRT

Late

ncy (m

s)Images/

sec

Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2-

16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16),

batch size 2, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587

Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake

with AVX512.

40x Faster CNNs on V100 vs. CPU-Only

Under 7ms Latency (ResNet50)

Page 8: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

8

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

Page 9: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

9

ACRONYMS AND DEFINITIONS

NMT: Neural Machine Translation

OpenNMT: Open source NMT project for academia and industry

Token: The minimum representation used for encoding(symbol, word, character, subword)

Sequence: A number of tokens wrapped by special start and end sequence tokens.

Beam Search: directed partial breadth-first tree search algorithm

TopK: Partial sort resulting in N min/max elements

Unk: Special token that represents unknown translations.

Page 10: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

10

OPENNMT INFERENCEDecoder

Encoder

Beam

Search

EncoderRNN

Decoder

RNN

Attention

Model

Projection

TopK

Output

Beam

Shuffle

Batc

h R

eductio

n

Beam

Scorin

g

Input

Setup

Input

Page 11: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

11

DECODER EXAMPLE

Decoder

RNN

Attention

Model

Projection

TopK

Output Embedding

Beam

Searc

h

Beam

Shuff

le

Batc

h

Reducti

on

Beam

Scori

ng

Input Embedding

Itera

tion 0

<S>

This

The

He

What

The

Itera

tion 1

+

ishouse

ran

tim

ecow

This

The

He

What

The

Page 12: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

12

TRAINING VS INFERENCE

Decoder

Encoder

EncoderRNN Decoder

RNN

Attention Model

Projection

Output

Input

SetupInput

Decoder

Encoder

Beam

Search

EncoderRNN

Decoder

RNN

Attention

Model

Projection

TopK

OutputBeam

Shuffle

Batc

h R

eductio

n

Beam

Scorin

g

Input

Setup

Input

Page 13: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

13

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

Page 14: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

14

INFERENCE TIME IS BEAM SEARCH TIME

• Wu, Et. Al. 2016, ‘Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation’ arXiv:1609.08144

• Sharan Narang, Jun, 2017, Baidu’s DeepBench -https://github.com/baidu-research/DeepBench

• Rui Zhao, Dec, 2017, ‘Why does inference run 20x slower than training.’ - https://github.com/tensorflow/nmt/issues/204

• David Levinthal, Ph.D., Jan, 2018, ‘Evaluating RNN performance across hardware platforms.’

Page 15: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

15

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

Page 16: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

16

PERF ANALYSIS

Page 17: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

17

KERNEL ANALYSIS

Page 18: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

18

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

Page 19: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

19

ENCODER

Encoder

EncoderRNN

Input

Setup

Page 20: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

20

Sequence

Length

Buffer

PrefixSumPlugin

Input

Hello.This is a test.

Bye.Tokenization

Hello .This is a test .Bye .

Gather

Encoder

Input

42 23 0 0 0 0

73 3 8 19 23 0

98 23 0 0 0 0

2

5

2

Decoder

Start

Tokens

Constant

Zero

State

buffer

Setup

Page 21: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

21

Sequence

LengthsEncoder

Input

Encoder

42 23 0 0 0 0

73 3 8 19 23 0

98 23 0 0 0 0

2

5

2

PackedRNN

Trained

Hidden

State

Trained

Cell

State

Encoder

Hidden

State

Encoder

Cell

StateContext

Vector

.1 .35 0 0 0 0

.123 .93 1.4 1 .01 0

.42 .20 0 0 0 0

Embedding Plugin

Page 22: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

22

DECODER

Decoder

Beam

Search

Decoder

RNN

Attention Model

Projection

TopK

Output

Beam

Shuffle

Batc

h R

eductio

n

Beam

Scorin

g

Input

Page 23: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

23

Decoder, 1st Iteration

RNN

Encoder

Hidden

State

Encoder

Cell

State

Decode

Hidden

State

Decode

Cell

State

Decoder

Input

Batch0 <S>

BatchN <S>

Decoder

Output

Batch0 .124

BatchN .912

Start Sentence

Token

Embedding Plugin

Page 24: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

24

Decoder, 2nd+ Iteration

RNN

Prev

Hidden

State

Prev

Cell

State

Next

Hidden

State

Next

Cell

State

Decoder

Input

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0 こ ん に ち は

N さ よ う な ら

Batch0 .124

BatchN .912

Decoder

Outp

ut

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0 .18 .32 .85 .39 .75

N .79 .27 .81 .93 .73

Embedding Plugin

Page 25: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

25

Global Attention Model

Context

Vector

Sequence

Length

Buffer

Decoder

Outp

ut

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0 .18 .32 .85 .39 .75

N .79 .27 .81 .93 .73

FullyConnected Weights

Weights

BatchedGemm

RaggedSoftmax

Concat

FullyConnected

TanH

BatchedGemm

Page 26: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

26

Projection

Atte

ntio

n

Outp

ut

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0 [.9,…,.1] [0,…,.3] [.1,…,0] [.6,…,.8] [.3,…,.2]

N [.4,…,.9] [.5,…,.2] [0,…,.7] [0,…,2] [.1,…,.9]

FullyConnected Weights

Softmax

Pro

jectio

n

Outp

ut

Log

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0

N

Page 27: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

27

TopK Part 1

Pro

jectio

n

Outp

ut

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0

N

TopK

Intra

-beam

Outp

ut

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

Index

Prob

[1,3]

[.9,.8]

[2,4]

[.99,.5]

[9,0]

[.3,.8]

[5,0]

[.1,.93]

[7,6]

[.85,.99]

Gather

Page 28: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

28

TopK Part 2

Gath

er

Outp

ut

Prob [.9,.8,.99,.55,.3,.8,.1,.93,.85,.99]

TopK

Inte

r-beam

Outp

ut

Indices [2,9,7,0,8]

Prob [.99,.99,.93,.9,.85]

Beam Mapping Plugin

Intra-beam

Output

Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

Page 29: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

29

Beam Search – Beam Shuffle

Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

Beam0

State

Beam1

State

Beam2

State

Beam3

State

Beam4

State

Beam1

State+1

Beam4

State+1

Beam3

State+1

Beam0

State+1

Beam4

State+1

Beam

Shuffle

Plugin

Page 30: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

30

Beam Search – Beam Scoring

Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

Beam0

State+1

Beam1

State+1

Beam2

State+1

Beam3

State+1

Beam4

State+1

Beam0

State

Beam1

State

Beam2

State

Beam3

State

Beam4

State

Beam Scoring Plugin

EOS

DetectionSentence

Probability

UpdateBacktrack

State

Storage Sequence

Length

IncrementEnd of

Beam/Batch

Heuristic

Batch Finished

Bitmap[0001100011…010]

Page 31: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

31

Beam Search – Batch Reduction

Batch Finished

Bitmap[0001100011…010]

Reduce Operation(Sum)Transfer 32bit to Host

as new batch size.

TopK

Gather

Encoder

Output

Encoder

Output

Encoder/State

Reduction

PluginBeam

State

Beam

State

Page 32: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

32

Output

Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

All Done? No Decoder Input

Yes

Beam

StateDevice To Host

On Host:

Beam

Sta

te Output

こんにちは。これはテストです。さようなら。

Page 33: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

33

TENSORRT ANALYSIS

Page 34: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

34

TENSORRT KERNEL ANALYSIS

Page 35: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

35

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

Page 36: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

36

RESULTS

425

550

280 ms

153 ms

117 ms

0

50

100

150

200

250

300

350

400

450

500

0

100

200

300

400

500

600

CPU-Only + Torch V100 + Torch V100 + TensorRT

Late

ncy (m

s)

Sente

nces/

sec

Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size

4, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On

140x Faster Language Translation RNNs on V100 vs. CPU-Only Inference

(OpenNMT)

Page 37: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

37

SUMMARY• Show that topK no longer dominates sequence inference time.

• Show that RNN Inference is compute bound, not memory bound.

• TensorRT accelerates sequence inferencing.

• Over two orders of magnitude higher throughput over CPU.

• Latency reduction by more than half over CPU.

developer.nvidia.com/tensorrt

PRODUCT PAGE

Page 38: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

38

LEARN MORE

developer.nvidia.com/tensorrt

PRODUCT PAGE

docs.nvidia.com/deeplearning/sdk

DOCUMENTATION

nvidia.com/dli

TRAINING

Page 39: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124

39

Q&A

Page 40: S8822 OPTIMIZING NMT WITH TENSORRTon-demand.gputechconf.com/gtc/2018/presentation/... · Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Batch0 .124