dmtk超大规模深度学习框架...

DMTK超大规模深度学习框架及其在文本理解中的应用

Taifeng Wang

Lead Researcher

Machine Learning Group, MSRA

2016 GTC China

9/13/2016 Taifeng Wang@GTC China 1

Redmond, Washington, USA Sep 1991

Cambridge, UK July 1997

Bangalore, India Jan 2005

Cambridge, Massachusetts, USA 2008

New York, USA May 2012

Beijing, China Nov 1998

Microsoft Research Lab Locations

1.0

Technologies transferred into all

major Microsoft products

Microsoft Research Asia

About DMTK (Distributed machine learning toolkit)

http://dmtk.io

Release to github on 11.09.2015 by Machine Learning group of MSRA.

https://github.com/Microsoft/DMTK

We focus on providing distributed machine learning infrastructure and algorithms to handle big data and big model learning tasks.


http://dmtk.io/

https://github.com/Microsoft/DMTK

DMTK User Engagement

• Within just one week after release (2015.11.10): • 1000+ stars and 200+ forks at GitHub

• 1M+ visits to DMTK homepage

• 300K+ downloads of binary executables

• Major upgrade (2016.9)


www.dmtk.io

• Parameter server 1.0 • Distributed LightLDA • Distributed Word2Vec

• Parameter server 2.0 • Simpler SDK usage • System performance

enhancement, e.g. memory & network cost reduction

• Enrich program language support for parameter server • Python • Lua

• Connect with torch/Caffe/Theano

• Deep integration with CNTK • Distributed Logistic

Regression with online update (FTRL)

• Distributed gradient boosting decision tree(GBDT)

• Innovation on distributed optimization: DC-ASGD, ensemble model, accelerated optimization methods

• Model parallelism on deep learning models

• 2C-RNN for text understanding

• Graph embedding

有关 DMTK – 发展历程

2015.11

2016.3

2016.6

2016.9

2016.12


Execution Engines

YARN

Microsoft Distributed Machine Learning Toolkit (DMTK)


Multiverso Parameter Server

Rich communication interface

MPI ZeroMQ

Distributed synchronization mechanism

MA / ADMM / BMUF ASGD /DC-ASGD

RDMA GPU direct

Matrix / tensor

Hash table Tree

Hybrid model store

Model Slicing

Distributed machine learning algorithms

2D-RNN LightGBM LightLDA Districted Word Embedding

AzureML CNTK Other Single node dnn tools:

Theano/caffe/torch

parallelize different machine learning toolkits

Logistic Regression

Workload supported

LightLDA Word2Vec GBDT LSTM CNN

Model

20M vocab, 1M topics (largest topic model)

Data

200B tokens (Bing web chunk)

Training time

60 hrs on 24 machines (nearly linear speed-up)

Model

10M vocab, 1000 dim (largest word embedding)

Data

200B samples (Bing web chunk)

Training time

40 hrs on 8 machines (nearly linear speed-up)

Model

3000 trees (120-node) (GBDT)

Data

7M records (Bing HRS data)

Training time

3 hrs on 8 machines (4x of speed-up)

Model

20M parameters (4 hidden layer, LSTM)

Data

1570 hrs speech data (Win phone data

Training time

1 day on 16 GPUs (15.9x speed-up)

Model

41M parameters (GoogleLeNet)

Data

2M images (ImageNet 1K dataset)

Training time

30 hrs on 16 GPUs (10x speed-up

Online FTRL

Model

800M parameters (Logistic Regression)

Data

6.4B impressions (Bing Ads click log)

Training time

2400s on 24 machines (12x speed-up)

4/25/2016 Taifeng Wang @ HKUST 8

Forward looking - Microsoft Cognitive Toolkits


如何推动大规模机器学习的发展

Algorithmic Innovation

• Machine learning algorithms themselves need to have sufficiently high efficiency and throughout.

• Existing design/implementation of machine learning algorithms might not have considered this request; redesign/re-implementation might be needed.

System Innovation

• One needs to leverage the full power of distributed system, and pursue almost linear scale out/speed up.

• New distributed training paradigm needs to be invented in order to revolve the bottle neck of existing distributed machine learning systems.


Evolution of Distributed ML

Dataflow (Deep learning)

Synchronous

Asynchronous

Data Parallelism Model Parallelism

Iterative MapReduce

(LDA, LR)

Parameter Server (Deep learning, LDA,

GBDT, LR)

Irregular Parallelism


Evolution of Distributed Machine Learning

Iterative MapReduce

• Use MapReduce / AllReduce to sync parameters among workers

• Only synchronous update

• Example: Spark and

other derived systems

Local computation

Synchronous update



Iterative MapReduce

Parameter Server

• Parameter server (PS) based solution is proposed to support: • Asynchronous update • Different mechanisms for model

aggregation, especially in asynchronous manner

• Model parallelism

• Example:

• Google’s DistBelief; Petuum • Multiverso PS

+ NIPS’12 DistBelief (Google), NIPS’13 Petuum (Eric Xing), OSDI’14 Parameter server (Mu Li), Multiverso PS… etc. 9/13/2016 Taifeng Wang@GTC China 13


Iterative MapReduce

Parameter Server

Dataflow based solution is proposed to support: • Irregular parallelism (e.g., hybrid

data- and model-parallelism), particularly in deep learning

• Both high-level abstraction and low-level flexibility in implementation

Example: • Google’s TensorFlow

Dataflow

+ Tensorflow, Eusys’07 Dryad (Microsoft), NSDI’12 Spark (AMP Lab)

Task scheduling & execution based on: 1. Data dependency 2. Resource availability

Dataflow Resource


Delay compensate ASGD Our work on system innovation


Delayed Gradients


• Sequential SGD 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η ∗ 𝑔 𝑤𝑡+τ

• Async SGD 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η ∗ 𝑔 𝑤𝑡

≠

𝑔 𝑤𝑡+τ = 𝑔 𝑤𝑡 + 𝛻𝑔 𝑤𝑡 · 𝑤𝑡+τ − 𝑤𝑡 + O( 𝑤𝑡+τ − 𝑤𝑡2)

𝛻𝑔 𝑤𝑡 corresponds to the Hessian matrix

Unbiased Efficient Approximation of Hessian Matrix


Theorem: Assume that 𝑌 is discrete random variable and 𝑃 𝑌 = 𝑘 𝑋 = 𝑥,𝑤 = 𝜎𝑘(𝑥; 𝑤), where 𝜎𝑘 𝑥; 𝑤 < 1, ∀𝑋,𝑤, 𝑘 = 1,… , 𝐾 . Let 𝐿 𝑥, 𝑦, 𝑤 = − 𝐼𝑦=𝑘 log 𝜎𝑘(𝑥;𝑤)𝑘 .

Then we can prove that there exists a function 𝜙, such that:

𝐸(𝑌|𝑥,𝑤)

𝜕2

𝜕𝑤2𝐿 𝑋, 𝑌;𝑤 = 𝐸 𝑌|𝑥,𝑤 𝜙

𝜕

𝜕𝑤𝐿 𝑋, 𝑌;𝑤

For cross-entropy loss, the second-order derivatives can be derived from first-order derivatives in an unbiased manner.

Delay Compensated ASGD (DC-ASGD)


DC-ASGD: 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η 𝑔 𝑤𝑡 − λ𝜙(𝑔 𝑤𝑡 ) · 𝑤𝑡+τ − 𝑤𝑡

ASGD: 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η 𝑔 𝑤𝑡

Experimental Result (based on ResNet)


CFAR ImageNet

2C-RNN: A Super Efficient and Scalable Deep Algorithm for Text Understanding Our work on algorithm innovation – published in NIPS 2016


Recurrent Neural Networks for text applications • A widely used model for sequence representation and learning

• Language modeling

• Machine translation

• Conversation bot

• Image/video caption

Major challenges: efficiency and scalability


Language modeling ℎ 𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊ℎ 𝑡−1 + 𝑏

𝑜 𝑡 = 𝑉ℎ 𝑡

𝑦 𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑜 𝑡

Symbol Definition Dimension

𝑥 𝑡 (Input) Embedding vector of the word at position 𝑡

𝑤

𝑈 Parameter matrix: input hidden state

ℎ ∗ 𝑤

𝑊 Parameter matrix: hidden state hidden state

ℎ ∗ ℎ

𝑉 Output embedding matrix: hidden state output

𝑉 ∗ ℎ

𝑦 𝑡 Predicted probability for each word

|𝑉|


Challenge in text applications: model size

• Large scale

• Large model size current GPU cannot support

http://www.dmtk.io/word2vec.html

Symbol Definition Dimension Memory Size

𝑥 (Input) Embedding vector of the word at position 𝑡

𝑉 ∗ 𝑤 10M*1024*4B = 40G

𝑈 Parameter matrix: input hidden state ℎ ∗ 𝑤 1024*1024*4B=4M


ℎ ∗ ℎ 1024*1024*4B=4M


𝑉 ∗ ℎ 10M*1024*4B = 40G

𝑦 𝑡 Predicted probability for each word |𝑉| 10M*4B = 40M

Dataset #token Vocab

Clueweb09(en) 143,820,387,816 10,784,180


Challenge in text applications: running time

• Large scale

• Huge computation complexity • To choose one word, we need to go through all the words in the vocabulary

http://www.dmtk.io/word2vec.html

ℎ 𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊ℎ 𝑡−1 + 𝑏



#operation Operation unit

ℎ 𝑡 2 million Float operation

𝑜 𝑡 10 billion Float operation

Dataset #token Vocab

Clueweb09(en) 143,820,387,816 10,784,180


Device Computation (#core, flops)

Global Mem Cap./BW

Running time

Xeon Broadwell 14nm (V16)

Broadwell 2*20Core, 0.736 TFLOPS

8x32GB DDR4 (256GB)/95GBps

0.143T*10G*10*2/0.736T/3600/24/365=1232 years

GPU (K40) 28nm

5.0 TFLOPS (float32) (2880 cores)

12GB GDDR5 /288GBps

0.143T*10G*10*2/5T/3600/24/365=181 years

GPU (M40) 28nm


24GB GDDR5/ 288GBps

0.143T*10G*10*2/6.8T/3600/24/365=133 years

GPU (P100) 16nm

10.6 TFLOPS (float32) 21.2 TFLOPS (float16)

16GB HBM2/720GBps

0.143T*10G*10*2/10.6T/3600/24/365=85 years

25

Training time estimation on mainstream hardware Dataset #token Vocab

Clueweb09(en) 143,820,387,816 10,784,180

0.143T*10G*10*2/5T/3600/24/365=181 years

#tokens

#operations per token #epochs

Forward and backward propagation

#FLOPS

9/13/2016 Taifeng Wang@GTC China

Big challenge to algorithm innovation and hardware manufactory

• Key problem - Huge vocabulary



𝑥 (Input) Embedding vector of the word at position 𝑡

𝑉 ∗ 𝑤 10M*1024*4B = 40G

𝑈 Parameter matrix: input hidden state ℎ ∗ 𝑤 1024*1024*4B=4M


ℎ ∗ ℎ 1024*1024*4B=4M


𝑉 ∗ ℎ 10M*1024*4B = 40G

𝑦 𝑡 Predicted probability for each word |𝑉| 10M*4B = 40M

Our proposal: 2-Component shared embedding (Accepted by NIPS 2016)

Embedding vector

word

𝑥1 January

𝑥2 February

… …

𝑥15 one

𝑥16 two

Embedding vector

𝒚𝟏 𝒚𝟐 𝒚𝟑 𝒚𝟒

𝑥1 January February

𝑥2 one two

𝑥3

𝑥4

Embedding vector

word

𝑥1, 𝑦1 January

(𝑥1, 𝑦2) February

… …

(𝑥2, 𝑦1) one

(𝑥2, 𝑦2) two

Current practice

==

Our approach

2C: each word is partitioned and represented by two vectors (𝑥, 𝑦) Shared embedding: 𝑥 is shared in the same row, y is shared in the same column |𝑉| vectors

2 |𝑉| vectors


2C-RNN

𝑥𝑡−1𝑟

ℎ𝑡−1𝑟

𝑃𝑐(𝑤𝑡−1)

𝑌𝑐

𝑈

𝑊

𝑋𝑟

𝑥𝑡−1𝑐

ℎ𝑡−1𝑐

𝑃𝑟(𝑤𝑡)

𝑌𝑟

𝑈

𝑊

𝑋c

𝑥𝑡𝑟

ℎ𝑡𝑟

𝑃𝑐(𝑤𝑡)

𝑌𝑐

𝑈

𝑊

𝑋𝑟

𝑥𝑡𝑐

ℎ𝑡𝑐

𝑃𝑟(𝑤𝑡+1)

𝑌𝑟

𝑈

𝑊

𝑋𝑐

𝑥𝑡+1𝑟

ℎ𝑡+1𝑟

𝑃𝑐(𝑤𝑡+1)

𝑌𝑐

𝑈

𝑊

𝑋𝑟

𝑥𝑡−1

ℎ𝑡−1

𝑃(𝑤𝑡)

𝑌

𝑊

𝑈

𝑥𝑡

ℎ𝑡

𝑃(𝑤𝑡+1)

𝑌

𝑊

𝑈 𝑈

𝑋

𝑤𝑡−1

𝑤𝑡

𝑤𝑡

𝑤𝑡+1

𝑋

𝑤𝑡−1 𝑤𝑡

𝑤𝑡 𝑤𝑡+1

Previous word

Predicted current word

Current word

Predicted next word


Analysis on model size


𝑥, 𝑦 (Input) Embedding vector of the word at position 𝑡 2 ∗ 𝑉 ∗ 𝑤 2*(10M)1/2*1024*4B = 25M

𝑈𝑥, 𝑈𝑦 Parameter matrix: input hidden state 2 ∗ ℎ ∗ 𝑤 2*1024*1024*4B=8M

𝑊𝑥,𝑊𝑦 Parameter matrix: hidden state hidden state 2 ∗ ℎ ∗ ℎ 2*1024*1024*4B=8M

𝑉𝑥, 𝑉𝑦 Output embedding matrix: hidden state output 2 ∗ 𝑉 ∗ ℎ 2*(10M)1/2*1024*4B = 25M

𝑦 𝑡 Predicted probability for each word 2 ∗ 𝑉 <1M

80G 70M ℎ 𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊ℎ 𝑡−1 + 𝑏




Analysis on computational complexity

10G 10M

#operation per token Operation unit

ℎ𝑥𝑡, ℎ𝑦

𝑡 4𝑀 Float operation

𝑜𝑥𝑡, 𝑜𝑦

𝑡 2 ∗ 10𝑀 ∗ 1024 = 6𝑀 Float operation

GPU (K40) 28nm


12GB GDDR5 /288GBps

0.143T*10M*10*2/5T/3600/24/365=0.18 years

Training time estimation: 181 years 0.18 years

ℎ 𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊ℎ 𝑡−1 + 𝑏




Now we can easily parallel with parameter server framework, if you have 20 machine -> 2-3 days

How to allocate words into the 2D table

• Cold start • Row partition according to prefix

• Column partition according to suffix

• Bootstrap • Train with current partitions for several

iterations

• Adjust partitions based on training loss

• Continue training

Billion Million Trillion xxxllion

react real return rexxxxx

sure prepare gre xxxxre

In the same column

In the same column

In the same row


Experimental results

Middle-sized Dataset 2013 ACL Workshop dataset

PPL. on test (ACLW-Spanish)

PPL. on test (ACLW-French)

model size

KN-4 [1] 219 243

MLBL[1] 203 227

LSTM word-in/word-out 186 202 61 M

LSTM char-cnn-in/word-out [2]

169 190 45 M

Our 2C-RNN [cold start] 184 210 17 M

Our 2C-RNN [bootstrap] 157 181 17 M

[1] non-lstm-rnn method baseline. http://jmlr.org/proceedings/papers/v32/botha14.pdf [2] previous state-of-art method using character-cnn-input, Character-Aware Neural Language Models, http://arxiv.org/abs/1508.06615

Method (1 GPU) Runtime(hours) Reallocation/Training

HSM 168 --

2C-RNN 82 0.19%

To achieve same PPL with HSM baseline


http://jmlr.org/proceedings/papers/v32/botha14.pdf

http://arxiv.org/abs/1508.06615

Experimental results Large scale Dataset one billion benchmark:

[1] One Billion Word benchmark for measuring progress in statistical language modeling, https://arxiv.org/abs/1312.3005 [2] Strategies for Training Large Vocabulary Neural Language Models, http://arxiv.org/abs/1512.04906 [3] blackout:speeding up recurrent neural network language models with very large vocabularies, https://arxiv.org/pdf/1511.06909v7.pdf

PPL. on test

model size

KN-5 [1] 68 2 G

HSM [2] 85 1.6 G

Blackout-RNN [3] 68 4.1 G

Our 2C-RNN [cold start] 78 41 M

Our 2C-RNN [bootstrap] 66 41 M

KN + HSM [2] 56 --

KN + Blackout-RNN [3] 47 --

KN + 2C-RNN 43 --

Method (1 GPU) Runtime(hours) Reallocation/Training

HSM 168 --

2C-RNN 70 2.36%

To achieve same PPL with HSM baseline


https://arxiv.org/abs/1312.3005

http://arxiv.org/abs/1512.04906

https://arxiv.org/pdf/1511.06909v7.pdf

Summary and forward looking

• DMTK includes innovation from both system and algorithm • Excellent speed up and widely available system integration • Advanced distributed optimization method • Many world leading algorithms

• Machine learning for distributed deep learning

• Learning how to acquire, select, and partition the data • Learning the optimal network structure • Learning how to perform model update • Learning how to tune the hyper-parameters • Learning how to aggregate local models

• Create an AI that can automatically create new AI!


Thanks! [email protected]

https://www.microsoft.com/en-us/research/people/taifengw/


DMTK有关材料 [email protected] http://www.dmtk.io https://github.com/Microsoft/multiverso/wiki

欢迎加入我们的微信群，一起讨论大数据人工智能

分布式机器学习联盟

mailto:[email protected]




mailto:[email protected]

http://www.dmtk.io/

https://github.com/Microsoft/multiverso/wiki

Bootstrap: bipartite graph matching


Comparison ∗

𝑐1 𝑐2 𝑐𝑘

𝑤1,1 𝑤1,𝑘 𝑤𝑘,1 𝑤𝑘,𝑘

Class based softmax

…

…

… …

Model size Training time Test time Generalization time

Standard 𝑂( 𝑉 × 𝑤) 𝑂( 𝑉 × 𝑤) 𝑂( 𝑉 × 𝑤) 𝑂( 𝑉 × 𝑤)

Tree based softmax 𝑂( 𝑉 × 𝑤) 𝑂(log 𝑉 × 𝑤) 𝑂(log 𝑉 × 𝑤) 𝑂( 𝑉 × 𝑤)

Class based softmax 𝑂( 𝑉 × 𝑤) 𝑂( |𝑉| × 𝑤) 𝑂( |𝑉| × 𝑤) 𝑂( 𝑉 × 𝑤)

Our 2C 𝑂( |𝑉| × 𝑤) 𝑂( |𝑉| × 𝑤) 𝑂( |𝑉| × 𝑤) 𝑂( |𝑉| × 𝑤)


dmtk超大规模深度学习框架...

Documents