maximizing utilization for data center inference with … · 2019-07-02 · tensorflow runtime...

43
정소영 상무 ([email protected] ) / 201972MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH TENSORRT INFERENCE SERVER

Upload: others

Post on 10-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

정소영상무 ([email protected]) / 2019년 7월 2일

MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH TENSORRTINFERENCE SERVER

Page 2: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

2

WORLD’S MOST ADVANCEDSCALE-OUT GPU

INTEGRATED INTO TENSORFLOW & ONNX SUPPORT

TENSORRT HYPERSCALE INFERENCE PLATFORM

TENSORRT INFERENCE SERVER

Page 3: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

3

TENSORRT INFERENCE SERVER

A Software Application for Deploying AI Models At Scale

▪ Maximum GPU Utilization

▪ Mechanisms for Large-Scale Inference Service

▪ Optimized for Management & Monitoring

▪ GitHub: https://github.com/NVIDIA/tensorrt-inference-server

Page 4: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

4

TENSORRT INFERENCE SERVERArchitected for Maximum Datacenter Utilization

Support a variety of model frameworks

TensorRT, TensorFlow, Caffe2, custom

Support concurrent model execution, one or multiple models

Multi-model, multi-GPU and asynchronous HTTP and GRPC request handling

Support many model types: CNN, RNN, “stateless”, “stateful”

Multiple scheduling and batching algorithms

Enable both “online” and “offline” inference use cases

Batch 1, batch n, dynamic batching

Enable scalable, reliable deployment

Prometheus metrics, live/ready endpoints, Kubernetes integration

Page 5: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

5

EXTENSIBLE ARCHITECTURE

Extensible backend architecture allows multiple framework and custom support

Extensible scheduler architecture allows support for different model types and different batching strategies

Leverage CUDA to support model concurrency and multi-GPU

Page 6: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

6

AVAILABLE METRICS

Category Name Use Case Granularity Frequency

GPU Utilization

Power usage Proxy for load on the GPU Per GPU Per second

Power limit Maximum GPU power limit Per GPU Per second

GPU utilizationGPU utilization rate

[0.0 - 1.0)

Per GPU Per second

GPU Memory

GPU Total Memory Total GPU memory, in bytes Per GPU Per second

GPU Used Memory Used GPU memory, in bytes Per GPU Per second

CountGPU & CPU

Request count Number of inference requests Per model Per request

Execution count

Number of model inference executions

Request count / execution count = avg dynamic request

batching

Per model Per request

Inference countNumber of inferences performed (one request counts as

“batch size” inferences)

Per model Per request

LatencyGPU & CPU

Latency: request time End-to-end inference request handling time Per model Per request

Latency: compute timeTime a request spends executing the inference model (in

the appropriate framework)

Per model Per request

Latency: queue timeTime a request spends waiting in the queue before being

executed

Per model Per request

Page 7: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

7

MODEL REPOSITORY

File-system based repository of the models loaded and served by the inference server

Model metadata describes framework, scheduling, batching, concurrency and other aspects of each model

ModelXplatform: TensorRTscheduler: defaultconcurrency: …

ModelYplatform: TensorRTscheduler: dynamic-batcherconcurrency: …

ModelZplatform: TensorFlowscheduler: sequence-batcherconcurrency: ...

Page 8: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

8

BACKEND ARCHITECTURE

Backend acts as interface between inference requests and a standard or custom framework

Supported standard frameworks: TensorRT, TensorFlow, Caffe2

Providers efficiently communicate inference request inputs and outputs (HTTP or GRPC)

Efficient data movement, no additional copies

ModelX Backend

Default

Scheduler

TensorRT Runtime

ModelX

Infe

rence

Request

Output

Tensors

Input

Tensors

Providers

Page 9: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

9

CUSTOM FRAMEWORKIntegrate Custom Logic Into Inference Server

Provide implementation of your “framework”/”runtime” as shared library

Implement simple API: Initialize, Finalize, Execute

All inference server features are available: multi-model, multi-GPU, concurrent execution, scheduling and batching algorithms, etc.

ModelCustom Backend

Default

Scheduler

Custom Wrapper

ModelC

ust

om

Infe

rence R

equest

Output

Tensors

Input

Tensors

Providers

Custom

Runtime

libcustom.so

Page 10: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

10

MULTIPLE MODELS

ModelZ Backend

Sequence

Batcher

TensorFlow Runtime

ModelZ

Infe

rence

Request

ModelY Backend

Dynamic

Batcher

TensorRT Runtime

ModelY

Infe

rence

Request

ModelX Backend

Default

Scheduler

TensorRT Runtime

ModelX

Infe

rence

Request

Page 11: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

11

MODEL CONCURRENCYMultiple Models Sharing a GPU

By default each model gets one instance on each available GPU (or 1 CPU instance if no GPUs)

Each instance has an execution context that encapsulates the state needed by the runtime to execute the model

ModelZ BackendModelY Backend

ModelX Backend

Default

Scheduler

TensorRT Runtime

ContextGPU

Page 12: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

12

MODEL CONCURRENCYMultiple Instances of the Same Model

Model metadata allows multiple instances to be configured for each model

Multiple model instances allow multiple inference requests to be executed simultaneously

GPU

ModelX Backend

Default

Scheduler

TensorRT Runtime

Context

Context

Context

Page 13: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

13

ModelZ Backend

Sequence

Batcher

TensorFlow Runtime

Context

Context

ModelY Backend

Dynamic

Batcher

TensorRT Runtime

Context

Context

MODEL CONCURRENCYMultiple Instances of Multiple Models

GPUModelX Backend

Default

Scheduler

TensorRT Runtime

Context

Context

Context

Page 14: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

14

CONCURRENT EXECUTION TIMELINEGPU Activity Over Time

Time

Incoming Inference Requests

ModelX

ModelX

ModelX

ModelY

ModelY

ModelY

Page 15: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

15

CONCURRENT EXECUTION TIMELINEGPU Activity Over Time

Time

Incoming Inference Requests

ModelX

ModelX

ModelX

ModelY

ModelY

ModelY

Execute ModelX

Page 16: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

16

CONCURRENT EXECUTION TIMELINEGPU Activity Over Time

Time

Incoming Inference Requests

ModelX

ModelX

ModelX

ModelY

ModelY

ModelY

Execute ModelX

Execute ModelX

Page 17: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

17

CONCURRENT EXECUTION TIMELINEGPU Activity Over Time

Time

Incoming Inference Requests

ModelX

ModelX

ModelX

ModelY

ModelY

ModelY

Execute ModelX

Execute ModelX

Execute ModelX

Page 18: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

18

CONCURRENT EXECUTION TIMELINEGPU Activity Over Time

Time

Incoming Inference Requests

ModelX

ModelX

ModelX

ModelY

ModelY

ModelY

Execute ModelX

Execute ModelX

Execute ModelX

Execute ModelY

Page 19: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

19

CONCURRENT EXECUTION TIMELINEGPU Activity Over Time

Time

Incoming Inference Requests

ModelX

ModelX

ModelX

ModelY

ModelY

ModelY

Execute ModelX

Execute ModelX

Execute ModelX

Execute ModelY

Execute ModelY

Page 20: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

20

CONCURRENT EXECUTION TIMELINEGPU Activity Over Time

Time

Incoming Inference Requests

ModelX

ModelX

ModelX

ModelY

ModelY

ModelY

Execute ModelX

Execute ModelX

Execute ModelX

Execute ModelY

Execute ModelY

Execute ModelY

Page 21: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

21

SHARING A GPUCUDA Enables Multiple Model Execution on a GPU

ModelY Backend

Dynamic

Batcher

TensorRT Runtime

Context

Context

ModelX Backend

Default

Scheduler

TensorRT Runtime

Context

Context

Context

CUDA Streams

GPU

Hard

ware

Schedule

r

Page 22: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

22

MUTLI-GPUExecution Contexts Can Target Multiple GPUs

ModelY Backend

Dynamic

Batcher

TensorRT Runtime

Context

Context

ModelX Backend

Default

Scheduler

TensorRT Runtime

Context

Context

Context

CUDA Streams

GPUH

ard

ware

Schedule

rGPU

Hard

ware

Schedule

r

Page 23: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

23

BATCHING VS NON-BATCHING

Batch size = 1

• Run a single inference task on a GPU

• Low-latency, but the GPU is underutilized

Batch size = N

• Group inference instances together

• High throughput and GPU utilization

• Allows employing Tensor Cores in Volta and Turing

Batching: Grouping Inference Requests Together

W W W W

W

Page 24: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

24

SCHEDULER ARCHITECTURE

Scheduler responsible for managing all inference requests to a given model

Distribute requests to the available execution contexts

Each model can configure the type of scheduler appropriate for the model

Model Backend

Scheduler

Runtime

Context

Context

Page 25: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

25

DEFAULT SCHEDULERDistribute Individual Requests Across Available Contexts

ModelX Backend

Default

Scheduler

Runtime

Context

Context

Batch-1 RequestBatch-4 Request

Page 26: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

26

DEFAULT SCHEDULERDistribute Individual Requests Across Available Contexts

ModelX Backend

Default

Scheduler

Runtime

Context

Context

Incoming requests to ModelXqueued in scheduler

Page 27: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

27

DEFAULT SCHEDULER

Assuming GPU is fully utilized by executing 2 batch-4 inferences at the same time.

Utilization = 3/8 = 37.5%

Distribute Individual Requests Across Available Contexts

ModelX Backend

Default

Scheduler

Runtime

Context

Context

requests assigned in orderto ready contexts

Page 28: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

28

DEFAULT SCHEDULERDistribute Individual Requests Across Available Contexts

ModelX Backend

Default

Scheduler

Runtime

Context

Context

When context completes anew request is assigned

Assuming GPU is fully utilized by executing 2 batch-4 inferences at the same time.

Utilization = 2/8 = 25%

Page 29: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

29

DEFAULT SCHEDULERDistribute Individual Requests Across Available Contexts

ModelX Backend

Default

Scheduler

Runtime

Context

Context

When context completes anew request is assigned

Assuming GPU is fully utilized by executing 2 batch-4 inferences at the same time.

Utilization = 4/8 = 50%

Page 30: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

30

DYNAMIC BATCHING SCHEDULER

Default scheduler takes advantage of multiple model instances

But GPU utilization dependent on the batch-size of the inference request

Batching is often on of the best ways to increase GPU utilization

Dynamic batch scheduler (aka dynamic batcher) forms larger batches by combining multiple inference request

Group Requests To Form Larger Batches, Increase GPU Utilization

Page 31: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

31

DYNAMIC BATCHING SCHEDULERGroup Requests To Form Larger Batches, Increase GPU Utilization

ModelY Backend

Dynamic

Batcher

Runtime

Context

Context

Batch-1 RequestBatch-4 Request

Page 32: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

32

DYNAMIC BATCHING SCHEDULERGroup Requests To Form Larger Batches, Increase GPU Utilization

ModelY Backend

Dynamic

Batcher

Runtime

Context

Context

Incoming requests to ModelYqueued in scheduler

Page 33: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

33

DYNAMIC BATCHING SCHEDULERGroup Requests To Form Larger Batches, Increase GPU Utilization

ModelY Backend

Dynamic

Batcher

Runtime

Context

Context

Dynamic batcher configuration for ModelY can specify preferred batch-size. Assume 4 gives best utilization.

Dynamic batcher groups requests to give 100% utilization

Page 34: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

34

SEQUENCE BATCHING SCHEDULER

Default and dynamic-batching schedulers work with stateless models; each request is scheduled and executed independently

Some models are stateful, a sequence of inference requests must be routed to the same model instance

“Online” ASR, TTS, and similar models

Models that use LSTM, GRU, etc. to maintain state across inference requests

Multi-instance and batching required by these models to maximum GPU utilization

Sequence-batching scheduler provides dynamically batching for stateful models

Dynamic Batching for Stateful Models

Page 35: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

35

SEQUENCE BATCHING SCHEDULERDynamic Batching for Stateful Models

ModelZ Backend

Sequence

Batcher

Runtime

Context

Context

Sequence: 3 inference requests

123

12345

Sequence: 5 inference requests

Page 36: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

36

SEQUENCE BATCHING SCHEDULERDynamic Batching for Stateful Models

ModelZ Backend

Sequence

Batcher

Runtime

Context

Context123 12345

Inference requests arrivein arbitrary order

Page 37: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

37

SEQUENCE BATCHING SCHEDULERDynamic Batching for Stateful Models

ModelZ Backend

Sequence

Batcher

Runtime

Context

Context123 12345

Sequence batcher allocates context slotto sequence and routes all requests to

that slot

Context has available slots, not usedwaiting requests due to stateful model

requirement

Page 38: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

38

SEQUENCE BATCHING SCHEDULERDynamic Batching for Stateful Models

ModelZ Backend

Sequence

Batcher

Runtime

Context

Context23 2345

Sequence batcher allocates context slotto sequence and routes all requests to

that slot

Page 39: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

39

SEQUENCE BATCHING SCHEDULERDynamic Batching for Stateful Models

ModelZ Backend

Sequence

Batcher

Runtime

Context

Context33

45

Sequence batcher allocates context slotto sequence and routes all requests to

that slot

Page 40: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

40

SEQUENCE BATCHING SCHEDULERDynamic Batching for Stateful Models

ModelZ Backend

Sequence

Batcher

Runtime

Context

Context45

On a fully-loaded server, all context slots would be occupied by sequences.

As soon as one sequence ends another is allocated to the slot.

Page 41: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

41

ENSEMBLE MODELSA way of pipelining models in TRTIS

ModelZ Backend

Sequence

Batcher

TensorFlow Runtime

ModelY Backend

Dynamic

Batcher

TensorRT Runtime

ModelX Backend

Default

Scheduler

TensorRT Runtime

Infe

rence R

equest

for

Ense

mble

Model

Page 42: MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

42

Concurrent Model ExecutionMultiple models (or multiple instances of same

model) may execute on GPU simultaneously

CPU Model Inference ExecutionFramework native models can execute inference

requests on the CPU

MetricsUtilization, count, and latency

Custom BackendCustom backend allows the user more flexibility

by providing their own implementation of an

execution engine through the use of a shared

library

Stateless / Stateful InferenceSupports many model types including CNN, RNN,

etc

Dynamic BatchingInference requests can be batched up by the

inference server to 1) the model-allowed

maximum or 2) the user-defined latency SLA

Multiple Model Format SupportTensorFlow GraphDef/SavedModel

TensorFlow and TensorRT GraphDef

TensorRT Plans

Caffe2 NetDef (ONNX import path)

Ensemble Model SupportAn Ensemble represents a pipeline of one or

more models and the connection of input and

output tensors between those models

Multi-GPU supportThe server can distribute inferencing across all

system GPUs

Recap