toward scalable deep learning -...

50
Electrical and Computer Engineering Seoul National University http://data.snu.ac.kr 윤성로 Toward Scalable Deep Learning 한국정보과학회 | 인공지능소사이어티 | 머신러닝 연구회 두 번째 딥러닝 워크샵 | 2015.10.16

Upload: dokien

Post on 22-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Electrical and Computer EngineeringSeoul National University

http://data.snu.ac.kr

윤성로

Toward Scalable Deep Learning

한국정보과학회 | 인공지능소사이어티 | 머신러닝 연구회두 번째 딥러닝 워크샵 | 2015.10.16

Breakthrough: Big Data + Machine Learning

Andrew Ng

Daphne Koller

• Training challenges

• T. Chilimbi et al. (OSDI 2014)

Shadow beyond the Revolution

http://blogs.nvidia.com/blog/2015/03/17/digits-devbox/

< MNIST test >

Ciresan, Dan Claudiu, et al. "Deep, big, simple neural nets for handwritten digit recognition." Neural computation 22.12 (2010): 3207-3220.

Machine Learning

Representations

Training

Graphical modelsNonparametricBayesian models

Sparse structuredinput/outputregression Deep learning

· · ·

• Representation + Training

• Basic form of ML

• Iterative-convergent

Parallelism in Machine Learning

F D, θ = L D, θ + r(θ)

θt+1 = θt + Δθ(D)

Data Parallel Model ParallelΔθ(D)

Δθ(D1)Δθ(D2)

Δθ(Dn)

···

Δθ1(D)

Δθ(D)

Δθ2(D)

Δθm(D)

···

E. Xing & Q. Ho., 2015 KDD Tutorial

• Far more complex and larger than conventional ML models

– A large number of model parameters to learn

– Many (mostly simple) computations with latent variables

– Needs scaling up/out computation & numerical optimization

Deep Learning: A New Learning Paradigm

• Minimize computation [Bengio, 2014]

– Improve (reduce) the ratio of

• # OF COMPUTATIONS / # OF PARAMETERS

– Extreme success story (but poor generalization): decision trees

• O(n) computations for O(2n) parameters

– Extreme unlucky story: deep neural nets

• O(n) computations for O(n) parameters

• Example

– Conditional computation(Bengio, 2014)

Dealing with the Challenges (1)

• Scale-up approaches

– Enhanced single machine performance

– Organized in SIMD blocks

• 10-fold to 100-fold speed-up

– Stuck with memory constraints!

Dealing with the Challenges (2)

Co-processor/Accelerator(SIMD, GPGPU, …)

Learningworkload

• Scale-out approaches

– Can handle enormous size of data or model

– Split the entire workload using

• Data parallelism

• Model parallelism

– Parameter communication issues!

Dealing with the Challenges (3)

Learningworkload

Distributed System(GraphLab, Hadoop, Spark, ...)

• Spark

– RDD-based programming model

– ML library (includes deep learning)

• GraphLab

– Gather-Apply-Scatter programming model

– Large-scale graph mining

• Petuum

– Key-value store + scheduler

– General-purpose large-scale ML

Notable ML Platforms

• GPU-based (scale-up)

• Distributed (scale-out)

Notable ML Platforms

DistBelief[D. Jeffrey et al., "Large scale distributed deep networks," NIPS 2012]

Project Adam[T. Chilimbi et al., "Project Adam: Building an efficient and scalable deep learning training system,“ OSDI 2014]

Keras( )

• DistBelief: supports both data and model parallelism

Recent Technological Trends

J. Dean et al. "Large scale distributed deep networks," NIPS 2012

• GPU-accelerated library of primitives for DNN

• Used by frameworks such as Caffe, Theano, …

– Ex) cuDNN (v3) vs. cuDNN (v2) on Caffe

Recent Technological Trends

https://developer.nvidia.com/cudnn

• Open-source, distributed, commercial-grade DL framework

– DeepLearning4j

– ND4J (Scientific computing library for JVM)

• Scalable backend

– Apache Hadoop and Spark

– GPUs

• Partners

Recent Technological Trends

• Large-scale distributed machine learning

– Considers both data and model parallelism

– Key-value store + dynamic scheduler

Recent Technological Trends

http://petuum.github.io/

» Retainable Evaluator Execution Framework

• An Apache incubator project

• Package a variety of data-processing libraries in a reusable form

– MapReduce, query, graph processing and stream data processing

Recent Technological Trends

REEF introduction (http://www.reef-project.org/welcome/)

Scalable Deep Learning Techniques

Examples of distributed schemes

1) Data parallelism• Hogwild! (B. Recht, et al., NIPS 2011)

• Downpour SGD (J. Dean, et al., NIPS 2012), Dogwild (C. Noel, et al., 2014)

2) Parameter Server (M. Li, et al., NIPS 2013)

3) Model parallelism (STRADS) (S. Lee, et al., NIPS 2014)

4) Acceleration with GPUs (CUDA convnet)

https://singa.incubator.apache.org/docs/frameworks.html

• Based on the independency between data

– Leads to concurrent executions for each data speed up

Data Parallelism

DATA

Samples

Attribut

es

Worker 2 Worker 3Worker 1

ModelAggregation

• Asynchronous running; Don’t Lock! Don’t Communicate!

– For each processor, calculate gradients independently

• Processors can overwrite each others’ work

Data Parallelism : Hogwild!

Y. Nishioka, “Scalable Task-Parallel SGD on Matrix Factorization in Multicore Architectures”, IPDPS 2015

• Guarantees a reasonable converge rate

– Exploits sparsity

• Better performance even in non-sparse examples than traditional synchronized techniques (e.g., SVM)

Data Parallelism : Hogwild!

B. Recht et al., "Hogwild: A lock-free approach to parallelizing stochastic gradient descent," NIPS 2011

• Hogwild: designed for shared-memory machines

– Limited scalability

• Expand the concept of Hogwild! to distributed systems– Asynchronous update gradients to master or parameter server

– Ex) downpour SGD & Dogwild! (=distributed Hogwild!)

Data Parallelism : Downpour SGD, Dogwild!

J. Dean et al., "Large scale distributed deep networks," NIPS 2012

• Parameter server– Widely used concept for distributed machine learning

– Separate servers for parameters in the model

• Key features (Li et al., 2013)– Efficient communication

– Flexible consistency models

– Elastic scalability

– Fault tolerance and durability

– Ease of Use

Parameter Server

M. Li et al. ,"Parameter server for distributed machine learning," Big Learning NIPS Workshop, 2013

• Model

– Usually expressed as a vector or an array

– Sparse data & linear model

• Not all parameters are used to calculate gradients

• Key–value vector

– 𝑤𝑤1, 𝑤𝑤2,⋯ ,𝑤𝑤𝑛𝑛 → {(i, w)|i ∈ Feature,𝑤𝑤 ∈ Weight}

– Used to transmit the parameters only which workers need

– Example:

Parameter Server : Key–Value Vector

(𝑤𝑤1,𝑤𝑤2,𝑤𝑤3,𝑤𝑤4)

(1,𝑤𝑤1)

(2,𝑤𝑤2)

(3,𝑤𝑤3)

(4,𝑤𝑤4)

• Server node– Data: a partition of the globally shared parameters

• Worker node– Data: a portion of the training data

– Task: local

• Push– Direction: Worker → Server

– Data: Calculated update value

• Pull– Direction : Server →Worker

– Data: Updated parameter

Parameter Server : Interface

M. Li et al., "Parameter server for distributed machine learning," Big Learning NIPS Workshop, 2013

• Partition

Parameter Server : Data & model Partition

Data

Model

Dimension

Server

Worker

Push Pull

• STRADS (Lee et al., 2014)

– STRucture-Aware Dynamic Scheduler

– Parameter server with dynamic scheduler

• Chooses a set of parameters which can be updated in parallel

• Parameters are not transmitted between masters and workers

STRADS

S. Lee et al., "On model parallelization and scheduling strategies for distributed machine learning,“ NIPS 2014

• Basic execution unit order– Schedule – Push - Pull

• Schedule– Subject: Master– Task

• Pick sets of model parameters that can be safely updated in parallel• Push

– Subject: Master– Tasks

• Dispatch computation jobs via the coordinator to the workers• Execute push to compute partial updates for each parameter

• Pull– Subject: key-value store– Tasks

• Aggregate the partial updates • Keep newly updated parameters

STRADS : Execution

• Performance advantages of STRADS

– Faster convergence

– Larger model size

STRADS : Performance

Latent Dirichlet Allocation Matrix Factorization

S. Lee et al., "On model parallelization and scheduling strategies for distributed machine learning,“ NIPS 2014

• Fast C++/CUDA implementation of convolutional neural networks

• Supports multiple-GPU training

CUDA-convnet

A. Krizhevsky, 2012

Convolutional layers

Fully-connected layers

GPU1 GPU1

GPU2

GPU2

• New features (wrt cuda-convnet):– Improved training time

– Enhanced data parallelism, model parallelism, and hybrids

CUDA-convnet2

(a) Computing fully-connected activities after assembling a big batch from last-stage convlayer activities.

(b) Each worker sending its last-stage convlayer activities to all the other workers in turns. In parallel withfeedforward & backprop computation, the next worker updates its activities.

(c) All of the workers sending # 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝐾𝐾

of their convlayer activities to all other workers. The workers then proceed as in (b)

Possible parallelizing schemes:

A. Krizhevsky,One weird trick for parallelizing convolutional neural networks, 2014.

CUDA-convnet2 : Model Parallelism (Fully Connected Layers)

A. Krizhevsky, One weird trick for parallelizing convolutional neural networks, 2014

• Open framework, models, and worked examples for deep learning– Pure C++/CUDA architecture for deep learning (Python, Matlab interfaces)

– Fast, well-tested code

– Tools, reference models, demos, and recipes

– Seamless switch between CPU and GPU

• Application– Object classification

– Learning semantic features

– Object detection

– Sequences

– Reinforcement learning

– Speech + text

Caffe

https://docs.google.com/presentation/d/1UeKXVgRvvxg9OUdh_UiC5G71UMscNPlvArsWER41PsU/edit#slide=id.gc2fcdcce7_216_0

• LeNet

– A network is a set of layers and their connections

– Caffe creates and checks the net from the definition

Caffe : Example

Layer : plain text scheme, not a code

LeNet

• Performance– 2 ms / image on K40 GPU

– <1 ms inference with Caffe + cuDNN v2 on Titan X

– 72 million images per day with batched IO

• Pros– Fast way to apply deep neural networks

– Support GPU

– Many common and new functions are supported

– Python and Matlab binding

• Cons– Only a few input formats and only one output format (HDF5)

Caffe : Pros and Cons

• Introduced by Google Brain research team– J. Dean et al., “Large scale distributed deep networks,” NIPS 2012

• Use large-scale cluster to distribute training and inference– Exploits both data & model parallelism

– Distributed optimization Algorithms using parameter server

• Downpour SGD

• Sandblaster L-BFGS

• Trains a DN w/ billions of params using tens of thousands of CPU cores– Capable of training a deep network 30x larger

– State-of-art performance on ImageNet1) (by 2012)

– Faster than a GPU on modestly sized deep networks

DistBelief

1) An image database w/ 16m images, 20k categories and 1b params

DistBelief : Partition Model Across Machines

J. Dean et al., "Large scale distributed deep networks," NIPS 2012

DistBelief : Asynchronous Distributed SGD

• Asynchronous communication on partitioned data

• Utilization of parameter server

𝑡𝑡𝑡𝑡𝑡𝑡𝑝𝑝𝑗𝑗(3) =

𝜕𝜕𝜕𝜕𝑤𝑤𝑗𝑗

�𝑖𝑖=201

300

ℎ𝑤𝑤 𝑥𝑥 𝑖𝑖 − 𝑦𝑦 𝑖𝑖 2

Computes gradient on partial data

𝑥𝑥 201 , 𝑦𝑦 201 , … , (𝑥𝑥 300 , 𝑦𝑦 300 )

J. Dean, et al. "Large scale distributed deep networks." NIPS 2012.

DistBelief : Downpour SGD

• Asynchronous distributed SGD

• Robust to machine failures

• Introduces additional stochasticity

• Adagrad

– Adaptive learning rate

– Improve robustness and scalability

1. Asynchronously fetching parameters to multiple model replicas

2. SGD process inside model

3. Asynchronously pushing gradients to parameter server

J. Dean et al., "Large scale distributed deep networks," NIPS 2012

• Optimizes and balances computation and communication

– Exploits model parallelism

– Minimized memory bandwidth and communication overhead

• Achieves high performance and scalability

– Also with accuracy improvement

– Multi-threaded model parameter updates without locks

– Asynchronous batched parameter updates

• Supports training any combination of

– Stacked convolutional and fully-connected network layers

Adam

• On a single machine: Multi-threaded training– Fast weight updates without lock (similar to Hogwild!)

• Multiple machines: Model partitioning– Reducing memory copies (= data transfer) using own network library

– Optimization of memory system: L3 cache, cache locality Use vector processing units for matrix multiplication

– Asynchronous mitigating (speed variance of machines)

– Asynchronous updates with a global parameter server

Adam : Architecture

T. Chilimbi et al., “Project adam: Building an efficient and scalable deep learning training system,” OSDI 14

• Application: Mnist / ImageNet

• 120 machines: 90 (training) + 20 (parameter server) + 10 (image server)

Adam : Results

<Performance of training nodes> <Scaling model size with more workers>

<Accuracy of two applications> 30 fewer machines2x improvements

• Data & model parallel approach

– Considers the three properties of ML stated below

• Three properties of general ML

– Error tolerance

• Robustness against limited errors in the middle of calculation

– Dynamic structural dependency

• Changes in correlation between parameters

– Non-uniform convergence

• Differences between the convergence speed for each parameters

Petuum

• Scheduler – The core of the model parallelism support

– User can schedule which parameters are updated by schedule( )

– Partial updates are aggregated by pull( )

• Worker– Parameters are received by schedule( )

– Updates are computed by push( )

– Any data storage system can be used

• Parameter server– Uses the Stale Synchronous Parallel (SSP) consistency model

– Table–based or key–value stores

Petuum : Architecture

• A parallel consistency model– Limits the difference of the number of iteration which have

progressed between workers

– Reduces network synchronization and communication costs due to error tolerant convergence

Petuum : Stale Synchronous Parallel (SSP)

Ho et al.,"More effective distributed ml via a stale synchronous parallel parameter server,“ NIPS 2013

• High relative speedup compared to other implementations

• Near-linear speed-up by increasing machines

Petuum : Performance

• Distributed deep learning platform for big data analytics– Support CNN, RBM, RNN and others– Flexible to run synchronous/asynchronous and hybrid framework– Support various neural net partitioning schemes

• Design goals– Generality

• Different categories of models• Different training frameworks

– Scalability• Scalable to a large model and training datasets

– ex) Trained with 1 billion parameters and 10M images– Ease of use

• Provides a simple programming model• Supports built-in models, Python binding, and web interface• Useable without much awareness of the underlying distributed platform

SINGA

• Worker Group– Loads a subset of training data and

computes gradients for model replica

– Workers within a group run synchronously

– Different worker groups run asynchronously

SINGA : Distributed Training

• Server Group– Maintains on ParamShard

– Handles requests of multiple worker groups for parameter updates

– Synchronize with neighboring groups

https://singa.incubator.apache.org/docs/architecture.html

1 server group & 1 worker group(synchronous frameworks)

1 server group & ≥ 1 worker groups(asynchronous frameworks)

Co-locateworker

andserver

AllReduce (Baidu’s DeepImage) Dogwild (distributed Hogwild)

Separateworker

andservergroups

Sandblaster Downpour

SINGA : Configurations

SINGA : Pros and Cons

• Pros

– Easy to use and support programming without much awareness of the underlying distributed platform

– Distributed architecture using synchronous, asynchronous and hybrid updates

• Cons

– Limited scale-up support (e.g., no support for GPUs)

• In the era of big data, deep learning techniques show higher accuracy than the traditional machine learning algorithms.

• However, deep learning often requires a huge amount of resources for showing state-of-the art performance on large-scale data .

• This talk provides a survey of recent proposals for alleviating the computational challenges involved in training large-scale deep neural networks.

– With emphasis on examples of scale-up or scale-out techniques

Summary