dmtk超大规模深度学习框架...

37
DMTK超大规模深度学习框架 及其在文本理解中的应用 Taifeng Wang Lead Researcher Machine Learning Group, MSRA 2016 GTC China 9/13/2016 Taifeng Wang@GTC China 1

Upload: others

Post on 03-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

DMTK超大规模深度学习框架及其在文本理解中的应用

Taifeng Wang

Lead Researcher

Machine Learning Group, MSRA

2016 GTC China

9/13/2016 Taifeng Wang@GTC China 1

Page 2: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Redmond, Washington, USA Sep 1991

Cambridge, UK July 1997

Bangalore, India Jan 2005

Cambridge, Massachusetts, USA 2008

New York, USA May 2012

Beijing, China Nov 1998

Microsoft Research Lab Locations

Page 3: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

1.0

Technologies transferred into all

major Microsoft products

Microsoft Research Asia

Page 4: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

About DMTK (Distributed machine learning toolkit)

http://dmtk.io

Release to github on 11.09.2015 by Machine Learning group of MSRA.

https://github.com/Microsoft/DMTK

We focus on providing distributed machine learning infrastructure and algorithms to handle big data and big model learning tasks.

9/13/2016 Taifeng Wang@GTC China 4

Page 5: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

DMTK User Engagement

• Within just one week after release (2015.11.10): • 1000+ stars and 200+ forks at GitHub

• 1M+ visits to DMTK homepage

• 300K+ downloads of binary executables

• Major upgrade (2016.9)

9/13/2016 Taifeng Wang@GTC China 5

www.dmtk.io

Page 6: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

• Parameter server 1.0 • Distributed LightLDA • Distributed Word2Vec

• Parameter server 2.0 • Simpler SDK usage • System performance

enhancement, e.g. memory & network cost reduction

• Enrich program language support for parameter server • Python • Lua

• Connect with torch/Caffe/Theano

• Deep integration with CNTK • Distributed Logistic

Regression with online update (FTRL)

• Distributed gradient boosting decision tree(GBDT)

• Innovation on distributed optimization: DC-ASGD, ensemble model, accelerated optimization methods

• Model parallelism on deep learning models

• 2C-RNN for text understanding

• Graph embedding

有关 DMTK – 发展历程

2015.11

2016.3

2016.6

2016.9

2016.12

9/13/2016 Taifeng Wang@GTC China 6

Page 7: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Execution Engines

YARN

Microsoft Distributed Machine Learning Toolkit (DMTK)

9/13/2016 Taifeng Wang@GTC China 7

Multiverso Parameter Server

Rich communication interface

MPI ZeroMQ

Distributed synchronization mechanism

MA / ADMM / BMUF ASGD /DC-ASGD

RDMA GPU direct

Matrix / tensor

Hash table Tree

Hybrid model store

Model Slicing

Distributed machine learning algorithms

2D-RNN LightGBM LightLDA Districted Word Embedding

AzureML CNTK Other Single node dnn tools:

Theano/caffe/torch

parallelize different machine learning toolkits

Logistic Regression

Page 8: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Workload supported

LightLDA Word2Vec GBDT LSTM CNN

Model

20M vocab, 1M topics (largest topic model)

Data

200B tokens (Bing web chunk)

Training time

60 hrs on 24 machines (nearly linear speed-up)

Model

10M vocab, 1000 dim (largest word embedding)

Data

200B samples (Bing web chunk)

Training time

40 hrs on 8 machines (nearly linear speed-up)

Model

3000 trees (120-node) (GBDT)

Data

7M records (Bing HRS data)

Training time

3 hrs on 8 machines (4x of speed-up)

Model

20M parameters (4 hidden layer, LSTM)

Data

1570 hrs speech data (Win phone data

Training time

1 day on 16 GPUs (15.9x speed-up)

Model

41M parameters (GoogleLeNet)

Data

2M images (ImageNet 1K dataset)

Training time

30 hrs on 16 GPUs (10x speed-up

Online FTRL

Model

800M parameters (Logistic Regression)

Data

6.4B impressions (Bing Ads click log)

Training time

2400s on 24 machines (12x speed-up)

4/25/2016 Taifeng Wang @ HKUST 8

Page 9: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Forward looking - Microsoft Cognitive Toolkits

9/13/2016 Taifeng Wang@GTC China 9

Page 10: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

如何推动大规模机器学习的发展

Algorithmic Innovation

• Machine learning algorithms themselves need to have sufficiently high efficiency and throughout.

• Existing design/implementation of machine learning algorithms might not have considered this request; redesign/re-implementation might be needed.

System Innovation

• One needs to leverage the full power of distributed system, and pursue almost linear scale out/speed up.

• New distributed training paradigm needs to be invented in order to revolve the bottle neck of existing distributed machine learning systems.

9/13/2016 Taifeng Wang@GTC China 10

Page 11: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Evolution of Distributed ML

Dataflow (Deep learning)

Synchronous

Asynchronous

Data Parallelism Model Parallelism

Iterative MapReduce

(LDA, LR)

Parameter Server (Deep learning, LDA,

GBDT, LR)

Irregular Parallelism

9/13/2016 Taifeng Wang@GTC China 11

Page 12: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Evolution of Distributed Machine Learning

Iterative MapReduce

• Use MapReduce / AllReduce to sync parameters among workers

• Only synchronous update

• Example: Spark and

other derived systems

Local computation

Synchronous update

9/13/2016 Taifeng Wang@GTC China 12

Page 13: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Evolution of Distributed Machine Learning

Iterative MapReduce

Parameter Server

• Parameter server (PS) based solution is proposed to support: • Asynchronous update • Different mechanisms for model

aggregation, especially in asynchronous manner

• Model parallelism

• Example:

• Google’s DistBelief; Petuum • Multiverso PS

+ NIPS’12 DistBelief (Google), NIPS’13 Petuum (Eric Xing), OSDI’14 Parameter server (Mu Li), Multiverso PS… etc. 9/13/2016 Taifeng Wang@GTC China 13

Page 14: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Evolution of Distributed Machine Learning

Iterative MapReduce

Parameter Server

Dataflow based solution is proposed to support: • Irregular parallelism (e.g., hybrid

data- and model-parallelism), particularly in deep learning

• Both high-level abstraction and low-level flexibility in implementation

Example: • Google’s TensorFlow

Dataflow

+ Tensorflow, Eusys’07 Dryad (Microsoft), NSDI’12 Spark (AMP Lab)

Task scheduling & execution based on: 1. Data dependency 2. Resource availability

Dataflow Resource

9/13/2016 Taifeng Wang@GTC China 14

Page 15: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Delay compensate ASGD Our work on system innovation

9/13/2016 Taifeng Wang@GTC China 15

Page 16: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Delayed Gradients

9/13/2016 Taifeng Wang@GTC China 16

• Sequential SGD 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η ∗ 𝑔 𝑤𝑡+τ

• Async SGD 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η ∗ 𝑔 𝑤𝑡

𝑔 𝑤𝑡+τ = 𝑔 𝑤𝑡 + 𝛻𝑔 𝑤𝑡 · 𝑤𝑡+τ − 𝑤𝑡 + O( 𝑤𝑡+τ − 𝑤𝑡2)

𝛻𝑔 𝑤𝑡 corresponds to the Hessian matrix

Page 17: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Unbiased Efficient Approximation of Hessian Matrix

9/13/2016 Taifeng Wang@GTC China 17

Theorem: Assume that 𝑌 is discrete random variable and 𝑃 𝑌 = 𝑘 𝑋 = 𝑥,𝑤 = 𝜎𝑘(𝑥; 𝑤), where 𝜎𝑘 𝑥; 𝑤 < 1, ∀𝑋,𝑤, 𝑘 = 1,… , 𝐾 . Let 𝐿 𝑥, 𝑦, 𝑤 = − 𝐼𝑦=𝑘 log 𝜎𝑘(𝑥;𝑤)𝑘 .

Then we can prove that there exists a function 𝜙, such that:

𝐸(𝑌|𝑥,𝑤)

𝜕2

𝜕𝑤2𝐿 𝑋, 𝑌;𝑤 = 𝐸 𝑌|𝑥,𝑤 𝜙

𝜕

𝜕𝑤𝐿 𝑋, 𝑌;𝑤

For cross-entropy loss, the second-order derivatives can be derived from first-order derivatives in an unbiased manner.

Page 18: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Delay Compensated ASGD (DC-ASGD)

9/13/2016 Taifeng Wang@GTC China 18

DC-ASGD: 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η 𝑔 𝑤𝑡 − λ𝜙(𝑔 𝑤𝑡 ) · 𝑤𝑡+τ − 𝑤𝑡

ASGD: 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η 𝑔 𝑤𝑡

Page 19: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Experimental Result (based on ResNet)

9/13/2016 Taifeng Wang@GTC China 19

CFAR ImageNet

Page 20: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

2C-RNN: A Super Efficient and Scalable Deep Algorithm for Text Understanding Our work on algorithm innovation – published in NIPS 2016

9/13/2016 Taifeng Wang@GTC China 20

Page 21: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Recurrent Neural Networks for text applications • A widely used model for sequence representation and learning

• Language modeling

• Machine translation

• Conversation bot

• Image/video caption

Major challenges: efficiency and scalability

9/13/2016 Taifeng Wang@GTC China 21

Page 22: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Language modeling ℎ 𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊ℎ 𝑡−1 + 𝑏

𝑜 𝑡 = 𝑉ℎ 𝑡

𝑦 𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑜 𝑡

Symbol Definition Dimension

𝑥 𝑡 (Input) Embedding vector of the word at position 𝑡

𝑤

𝑈 Parameter matrix: input hidden state

ℎ ∗ 𝑤

𝑊 Parameter matrix: hidden state hidden state

ℎ ∗ ℎ

𝑉 Output embedding matrix: hidden state output

𝑉 ∗ ℎ

𝑦 𝑡 Predicted probability for each word

|𝑉|

9/13/2016 Taifeng Wang@GTC China 22

Page 23: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Challenge in text applications: model size

• Large scale

• Large model size current GPU cannot support

http://www.dmtk.io/word2vec.html

Symbol Definition Dimension Memory Size

𝑥 (Input) Embedding vector of the word at position 𝑡

𝑉 ∗ 𝑤 10M*1024*4B = 40G

𝑈 Parameter matrix: input hidden state ℎ ∗ 𝑤 1024*1024*4B=4M

𝑊 Parameter matrix: hidden state hidden state

ℎ ∗ ℎ 1024*1024*4B=4M

𝑉 Output embedding matrix: hidden state output

𝑉 ∗ ℎ 10M*1024*4B = 40G

𝑦 𝑡 Predicted probability for each word |𝑉| 10M*4B = 40M

Dataset #token Vocab

Clueweb09(en) 143,820,387,816 10,784,180

9/13/2016 Taifeng Wang@GTC China 23

Page 24: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Challenge in text applications: running time

• Large scale

• Huge computation complexity • To choose one word, we need to go through all the words in the vocabulary

http://www.dmtk.io/word2vec.html

ℎ 𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊ℎ 𝑡−1 + 𝑏

𝑜 𝑡 = 𝑉ℎ 𝑡

𝑦 𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑜 𝑡

#operation Operation unit

ℎ 𝑡 2 million Float operation

𝑜 𝑡 10 billion Float operation

Dataset #token Vocab

Clueweb09(en) 143,820,387,816 10,784,180

9/13/2016 Taifeng Wang@GTC China 24

Page 25: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Device Computation (#core, flops)

Global Mem Cap./BW

Running time

Xeon Broadwell 14nm (V16)

Broadwell 2*20Core, 0.736 TFLOPS

8x32GB DDR4 (256GB)/95GBps

0.143T*10G*10*2/0.736T/3600/24/365=1232 years

GPU (K40) 28nm

5.0 TFLOPS (float32) (2880 cores)

12GB GDDR5 /288GBps

0.143T*10G*10*2/5T/3600/24/365=181 years

GPU (M40) 28nm

6.8 TFLOPS (float32) (3072 cores)

24GB GDDR5/ 288GBps

0.143T*10G*10*2/6.8T/3600/24/365=133 years

GPU (P100) 16nm

10.6 TFLOPS (float32) 21.2 TFLOPS (float16)

16GB HBM2/720GBps

0.143T*10G*10*2/10.6T/3600/24/365=85 years

25

Training time estimation on mainstream hardware Dataset #token Vocab

Clueweb09(en) 143,820,387,816 10,784,180

0.143T*10G*10*2/5T/3600/24/365=181 years

#tokens

#operations per token #epochs

Forward and backward propagation

#FLOPS

9/13/2016 Taifeng Wang@GTC China

Page 26: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Big challenge to algorithm innovation and hardware manufactory

• Key problem - Huge vocabulary

9/13/2016 Taifeng Wang@GTC China 26

Symbol Definition Dimension Memory Size

𝑥 (Input) Embedding vector of the word at position 𝑡

𝑉 ∗ 𝑤 10M*1024*4B = 40G

𝑈 Parameter matrix: input hidden state ℎ ∗ 𝑤 1024*1024*4B=4M

𝑊 Parameter matrix: hidden state hidden state

ℎ ∗ ℎ 1024*1024*4B=4M

𝑉 Output embedding matrix: hidden state output

𝑉 ∗ ℎ 10M*1024*4B = 40G

𝑦 𝑡 Predicted probability for each word |𝑉| 10M*4B = 40M

Page 27: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Our proposal: 2-Component shared embedding (Accepted by NIPS 2016)

Embedding vector

word

𝑥1 January

𝑥2 February

… …

𝑥15 one

𝑥16 two

Embedding vector

𝒚𝟏 𝒚𝟐 𝒚𝟑 𝒚𝟒

𝑥1 January February

𝑥2 one two

𝑥3

𝑥4

Embedding vector

word

𝑥1, 𝑦1 January

(𝑥1, 𝑦2) February

… …

(𝑥2, 𝑦1) one

(𝑥2, 𝑦2) two

Current practice

==

Our approach

2C: each word is partitioned and represented by two vectors (𝑥, 𝑦) Shared embedding: 𝑥 is shared in the same row, y is shared in the same column |𝑉| vectors

2 |𝑉| vectors

9/13/2016 Taifeng Wang@GTC China 27

Page 28: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

2C-RNN

𝑥𝑡−1𝑟

ℎ𝑡−1𝑟

𝑃𝑐(𝑤𝑡−1)

𝑌𝑐

𝑈

𝑊

𝑋𝑟

𝑥𝑡−1𝑐

ℎ𝑡−1𝑐

𝑃𝑟(𝑤𝑡)

𝑌𝑟

𝑈

𝑊

𝑋c

𝑥𝑡𝑟

ℎ𝑡𝑟

𝑃𝑐(𝑤𝑡)

𝑌𝑐

𝑈

𝑊

𝑋𝑟

𝑥𝑡𝑐

ℎ𝑡𝑐

𝑃𝑟(𝑤𝑡+1)

𝑌𝑟

𝑈

𝑊

𝑋𝑐

𝑥𝑡+1𝑟

ℎ𝑡+1𝑟

𝑃𝑐(𝑤𝑡+1)

𝑌𝑐

𝑈

𝑊

𝑋𝑟

𝑥𝑡−1

ℎ𝑡−1

𝑃(𝑤𝑡)

𝑌

𝑊

𝑈

𝑥𝑡

ℎ𝑡

𝑃(𝑤𝑡+1)

𝑌

𝑊

𝑈 𝑈

𝑋

𝑤𝑡−1

𝑤𝑡

𝑤𝑡

𝑤𝑡+1

𝑋

𝑤𝑡−1 𝑤𝑡

𝑤𝑡 𝑤𝑡+1

Previous word

Predicted current word

Current word

Predicted next word

9/13/2016 Taifeng Wang@GTC China 28

Page 29: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Analysis on model size

Symbol Definition Dimension Memory Size

𝑥, 𝑦 (Input) Embedding vector of the word at position 𝑡 2 ∗ 𝑉 ∗ 𝑤 2*(10M)1/2*1024*4B = 25M

𝑈𝑥, 𝑈𝑦 Parameter matrix: input hidden state 2 ∗ ℎ ∗ 𝑤 2*1024*1024*4B=8M

𝑊𝑥,𝑊𝑦 Parameter matrix: hidden state hidden state 2 ∗ ℎ ∗ ℎ 2*1024*1024*4B=8M

𝑉𝑥, 𝑉𝑦 Output embedding matrix: hidden state output 2 ∗ 𝑉 ∗ ℎ 2*(10M)1/2*1024*4B = 25M

𝑦 𝑡 Predicted probability for each word 2 ∗ 𝑉 <1M

80G 70M ℎ 𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊ℎ 𝑡−1 + 𝑏

𝑜 𝑡 = 𝑉ℎ 𝑡

𝑦 𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑜 𝑡

9/13/2016 Taifeng Wang@GTC China 29

Page 30: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Analysis on computational complexity

10G 10M

#operation per token Operation unit

ℎ𝑥𝑡, ℎ𝑦

𝑡 4𝑀 Float operation

𝑜𝑥𝑡, 𝑜𝑦

𝑡 2 ∗ 10𝑀 ∗ 1024 = 6𝑀 Float operation

GPU (K40) 28nm

5.0 TFLOPS (float32) (2880 cores)

12GB GDDR5 /288GBps

0.143T*10M*10*2/5T/3600/24/365=0.18 years

Training time estimation: 181 years 0.18 years

ℎ 𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊ℎ 𝑡−1 + 𝑏

𝑜 𝑡 = 𝑉ℎ 𝑡

𝑦 𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑜 𝑡

9/13/2016 Taifeng Wang@GTC China 30

Now we can easily parallel with parameter server framework, if you have 20 machine -> 2-3 days

Page 31: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

How to allocate words into the 2D table

• Cold start • Row partition according to prefix

• Column partition according to suffix

• Bootstrap • Train with current partitions for several

iterations

• Adjust partitions based on training loss

• Continue training

Billion Million Trillion xxxllion

react real return rexxxxx

sure prepare gre xxxxre

In the same column

In the same column

In the same row

9/13/2016 Taifeng Wang@GTC China 31

Page 32: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Experimental results

Middle-sized Dataset 2013 ACL Workshop dataset

PPL. on test (ACLW-Spanish)

PPL. on test (ACLW-French)

model size

KN-4 [1] 219 243

MLBL[1] 203 227

LSTM word-in/word-out 186 202 61 M

LSTM char-cnn-in/word-out [2]

169 190 45 M

Our 2C-RNN [cold start] 184 210 17 M

Our 2C-RNN [bootstrap] 157 181 17 M

[1] non-lstm-rnn method baseline. http://jmlr.org/proceedings/papers/v32/botha14.pdf [2] previous state-of-art method using character-cnn-input, Character-Aware Neural Language Models, http://arxiv.org/abs/1508.06615

Method (1 GPU) Runtime(hours) Reallocation/Training

HSM 168 --

2C-RNN 82 0.19%

To achieve same PPL with HSM baseline

9/13/2016 Taifeng Wang@GTC China 32

Page 33: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Experimental results Large scale Dataset one billion benchmark:

[1] One Billion Word benchmark for measuring progress in statistical language modeling, https://arxiv.org/abs/1312.3005 [2] Strategies for Training Large Vocabulary Neural Language Models, http://arxiv.org/abs/1512.04906 [3] blackout:speeding up recurrent neural network language models with very large vocabularies, https://arxiv.org/pdf/1511.06909v7.pdf

PPL. on test

model size

KN-5 [1] 68 2 G

HSM [2] 85 1.6 G

Blackout-RNN [3] 68 4.1 G

Our 2C-RNN [cold start] 78 41 M

Our 2C-RNN [bootstrap] 66 41 M

KN + HSM [2] 56 --

KN + Blackout-RNN [3] 47 --

KN + 2C-RNN 43 --

Method (1 GPU) Runtime(hours) Reallocation/Training

HSM 168 --

2C-RNN 70 2.36%

To achieve same PPL with HSM baseline

9/13/2016 Taifeng Wang@GTC China 33

Page 34: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Summary and forward looking

• DMTK includes innovation from both system and algorithm • Excellent speed up and widely available system integration • Advanced distributed optimization method • Many world leading algorithms

• Machine learning for distributed deep learning

• Learning how to acquire, select, and partition the data • Learning the optimal network structure • Learning how to perform model update • Learning how to tune the hyper-parameters • Learning how to aggregate local models

• Create an AI that can automatically create new AI!

9/13/2016 Taifeng Wang@GTC China 34

Page 35: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Thanks! [email protected]

https://www.microsoft.com/en-us/research/people/taifengw/

9/13/2016 Taifeng Wang@GTC China 35

DMTK有关材料 [email protected] http://www.dmtk.io https://github.com/Microsoft/multiverso/wiki

欢迎加入我们的微信群,一起讨论大数据人工智能

分布式机器学习联盟

Page 36: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Bootstrap: bipartite graph matching

9/13/2016 Taifeng Wang@GTC China 36

Page 37: DMTK超大规模深度学习框架 及其在文本理解中的应用images.nvidia.com/cn/gtc/downloads/pdf/deep-learning/108 DMTK- … · Recurrent Neural Networks for text applications

Comparison ∗

𝑐1 𝑐2 𝑐𝑘

𝑤1,1 𝑤1,𝑘 𝑤𝑘,1 𝑤𝑘,𝑘

Class based softmax

… …

Model size Training time Test time Generalization time

Standard 𝑂( 𝑉 × 𝑤) 𝑂( 𝑉 × 𝑤) 𝑂( 𝑉 × 𝑤) 𝑂( 𝑉 × 𝑤)

Tree based softmax 𝑂( 𝑉 × 𝑤) 𝑂(log 𝑉 × 𝑤) 𝑂(log 𝑉 × 𝑤) 𝑂( 𝑉 × 𝑤)

Class based softmax 𝑂( 𝑉 × 𝑤) 𝑂( |𝑉| × 𝑤) 𝑂( |𝑉| × 𝑤) 𝑂( 𝑉 × 𝑤)

Our 2C 𝑂( |𝑉| × 𝑤) 𝑂( |𝑉| × 𝑤) 𝑂( |𝑉| × 𝑤) 𝑂( |𝑉| × 𝑤)

9/13/2016 Taifeng Wang@GTC China 37