inference optimization using tensorrt with use cases · 2018. 11. 26. · mixture of experts neural...

33
Jack Han / 한재근 | Solutions Architect | NVIDIA Inference Optimization Using TensorRT with Use Cases

Upload: others

Post on 08-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

Jack Han / 한재근 | Solutions Architect | NVIDIA

Inference Optimization Using TensorRT

with Use Cases

Page 2: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation
Page 3: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

TensorRT 4 Adoption

Video

MapsImage

NLP

Speech

Search

Use Cases

Page 4: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

AI Inference is exploding

SPEECH

1 BillionVoice Searches Per Day

Google, Bing, etc.

LIVE VIDEO

1 BillionVideos Watched Per Day

Facebook

RECOMMENDATIONS

1 TrillionAds/Rankings Per Day

Impressions

Page 5: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

2011 2012 2013 2014 2015 2016 2017

Image(GOP * Bandwidth)

ResNet-50

Inception-v2

Inception-v4

AlexNet

GoogleNet

350X

2013 2014 2015 2016 2017 2018

Speech(GOP * Bandwidth)

DeepSpeech

DeepSpeech 2

DeepSpeech 3

30X

2014 2015 2016 2017 2018

Translation(GOP * Bandwidth)

MoE

OpenNMT

GNMT

10X

Bigger and More ComputeIntensive in quest for accuracy

Page 6: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

Cambrian ExplosionConvolutional

Networks

ReLuEncoder/Decoder BatchNorm

Dropout PoolingConcat

Recurrent Networks

GRULSTM

CTC

Beam Search

WaveNet Attention

Generative Adversarial Networks

Speech Enhancement GANCoupled GAN

Conditional GANMedGAN3D-GAN

Reinforcement Learning

DQN Simulation

DDPG

New Species

Neural Collaborative FilteringMixture of Experts

Block Sparse LSTM

Capsule Nets

Page 7: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

Inefficiency Limits Innovation

Custom Development

Developers need to reinvent the

plumbing for every application

Single Model Only

Some systems are overused

while others are

underutilized

ASR NLPRec-

ommender

!

Single Framework Only

Solutions can only support

models from one framework

Page 8: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

More Accuracy = Time

Speed/accuracy trade-offs for modern convolutional object detectors, Google Research

Page 9: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

Inference Performance

Latency EfficiencyThroughput

Easy Deploy

Page 10: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

Inference Optimization

Lightweight Model Deep Compression

Small Computation

Suitable Lightweight Hardware

E.g;

MobileNet

SqueezeNet

SEP-Nets

No Network Modification

Continuous the Enovation

E.g;

Quantization and Binarization

Network Pruning and Sharing

Distillation / Factorization

Page 11: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

TensorRTFrom framework to target

DRIVE PX 2

JETSON TX2

NVIDIA DLA

TESLA T4

TESLA V100

FRAMEWORKS GPU PLATFORMS

TensorRT

Optimizer Runtime

Page 12: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

Speed-up

0

1000

2000

3000

4000

5000

0 10 20 30 40 50 60

TensorRT INT8 TensorRT FP16 TensorRT FP32 GPU Native FP32 CPU Native FP32

ResNet-50

V100

Batch Response time (ms)

Th

rou

gh

put(im

ag

es/s

ec)

Page 13: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

TensorRT 5

More Layers / Plugin / APIs

Inference Server Integration

Page 14: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

TensorRT Workflow

Optimize Plan

Step 1. Optimize trained model

Model

Page 15: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

TensorRT Workflow

Embedded

Automotive

Data center

Step 2. Deploy optimized plan

Plan Runtime

Engine

Page 16: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

Step-by-StepTensorRT Plan Build

Network

C++/Python API

Model Parser

Network

Definitions

TensorRT

Builder Engine

https://docs.nvidia.com/deeplearning/sdk/

tensorrt-developer-guide/index.html#support_op

Page 17: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

Step 1: Convert trained model into

TensorRT format

Step 2: Create a model parser

Step 3: Register inputs and outputs

Step 4: Optimize model and create

a runtime engine

Step 5: Serialize optimized engine

Step 6: De-serialize engine

Step 7: Perform inference

7 Steps to Deployment with TensorRT

Page 18: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

TensorRT

Builder Engine

Network

C++/Python API

Network

Definitions

Model Parser

Plugin

Factory

Plugin A Plugin B

Custom Layer Supportusing Plugin Layer

Page 19: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

Plugin Layer Procedure

Plugin LayeriCudaEnginePlugin

Factory

Layer?Initialize

(missing)Phase 1.

Initialize

enqueuePhase 2.

Inference

Serialize

DeserializePhase 3.

Store/Load

Page 20: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

Supporting Features

Feature C++Python

(x86 only)NvCaffeParser NvUffParser

CNNs Yes Yes Yes Yes

RNNs Yes Yes No No

INT8 Calibration Yes Yes N/A N/A

Asymmetric

PaddingYes Yes No No

Page 21: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

TensorRT APIDefine your network using C/C++ or Python

INetworkDefinition* network = builder->createNetwork();

convertRNNWeights, convertRNNBias

setWeightsForGate, setBiasForGate

Page 22: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

TensorRT RNN Programming

Dump Weights Load Weights Convert Weights Set Weights

TF format TRT format

TensorRT

RNN Layer

TF format

/usr/src/tensorrt/samples/

common/dumpTFWts.py

ckpt wts

Get started with sampleCharRNN in documentation

Page 23: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

NMT with Transformer Inference

Running 3 binded engines

(Encoder, Generator, Suffle Engine)

Encoder Decoder

Beam

Search

Encoder RNN

Decoder

RNN

Attention

Model

Projection

TopK

Output

Be

am

Sh

uffle

Ba

tch

Re

du

ctio

n

Beam

Scorin

g

Input

Setup

Input

* CPU-Only: TensorFlow on SKL 6140 18 core, FP32

GPU: V100, TensorRT 5, FP16; Sorted data, Batch=128, English to German

Runs on CPU

GPU-Accelerated

Support NMT layers such as

Gather, Softmax, Batch GEMM and Top K

Modular Network Merge

Deploy highly-optimized

language translation apps in

production environments

Get started with NMT sample in documentation

Page 24: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

Recommendation Inference

TopK

Output Top

Scores

Embedding

Embedding

Embedding

Embedding

Fused MLP

Kernel

Fully Connected

Bias

Activation

Fully Connected

Bias

Activation

Fully Connected

Bias

Activation

Embedding

Concat

xNUser

1

Context

Item

Item

ItemN

2

...

Runs on CPU

GPU-Accelerated

* CPU-Only: TensorFlow on Intel Xeon E5-2698 v4 CPU

at 2.20 GHz; GPU: V100 with custom networks

Deploy multi-layer perceptron (MLP) based

recommendation apps in production

Predict events (click, purchase, interest)

accurately based on input items (user,

query, observed activity)

Get started with examples on

MNIST character recognition and

movie recommender system

in the documentation

Page 25: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

Inferencing with Low precision

5.522

65

130

260

0

50

100

150

200

250

300

TFLO

PS

/ TO

PS

Peak Performance

T4P4Float INT8 Float INT8 INT4

Page 26: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

Post Training QuantizationMinimize information loss between FP32 and INT8

inference on calibration dataset

100 mini-batches are sufficient

to determine quantization parameter

(~500 samples)

Page 27: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

New INT8 APIs and Optimizations

Quantization Aware Training

Custom Calibration

Auto CalibrationFP32

Training

Optimize

to INT8

FP32 weights

Calibration Data O(1000) Images

FP32

Training

Optimize

to INT8

FP32 weights Custom

Calibration

FP32 or INT8 weights

Custom Activation ranges

Quantization

Aware Training

in FP32

Optimize

to INT8

Custom Activation Ranges

FP32 or

accurate INT8 weights

Maximize throughput at low latency with mixed precision compute in production

Apply INT8 quantization aware training or custom calibration algorithms with new APIs

Control precision per-layer with new APIs

Optimizations for depth wise convolution operation

Getting Start with INT8 sample in the documentation

Page 28: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

Inference Performance

0

500

1000

1500

2000

2500

3000

1 2 4 8

P4 - INT8 V100 - Mixed

0

5

10

15

20

25

1 2 4 8P4 - INT8 V100 - Mixed

Throughput Performance/WATT

Higher is better!!

Page 29: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

30

WIDELY ADOPTED

Page 30: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

31

NVIDIA INFERENCE MOMENTUM

Image Tagging Video Analysis Advertising Impact Video Captioning Visual Search

Finding Music Sports Performance Customer Service Visual SearchIndustrial

InspectionVoice Recognition

Cybersecurity

Page 31: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

32

Using GPUs made it possible to enable media

understanding of the Twitter platform, by

drastically reducing media deep learning model

training time and by allowing us to derive real-time

understanding of live videos at inference time.

Nicolas Koumchatzky, Head of Twitter Cortex

Page 32: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

33

“ At Microsoft, we are working hard to deliver the most advanced AI-powered services to our

customers. Using NVIDIA GPUs in real-time inference workloads has improved Bing’s advanced search

offerings, enabling us to reduce object detection latency for images. We look forward to working

with NVIDIA’s next generation inference hardware and software to expand the way people benefit

from AI products and services.”

-Jordi Ribas CVP, Bing and AI Products, Microsoft

“ AI is becoming increasingly pervasive, and inference is a critical capability customers need to

successfully deploy their AI models, so we’re excited to support NVIDIA’s Turing Tesla T4 GPUs on

Google Cloud Platform soon.”

-Chris Kleban, product manager at Google Cloud

Page 33: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

SEOUL | NOVEMBER 7 - 8,2018

www.nvidia.com/ko-kr/ai-conference/