ai&bigdata lab 2016. Руденко Петр: Особенности обучения,...

Training, tuning, selecting & serving of machine learning models at scale

Peter Rudenko@peter_rud

[email protected]

Typical machine learning workflow

Input data

Model trainingPrediction

ETL

Preprocessing, feature engineering

Model tuning (selecting best

hyperparameters)

Data partitioning

Optimising model parameters

Low latency Batch

Automatic Machine Learning

fblearner

Deep Feature Synthesis: Towards Automating Data Science Endeavors (MIT)

Datarobot.com

Test data

https://code.facebook.com/posts/1072626246134461/introducing-fblearner-flow-facebook-s-ai-backbone/

https://code.facebook.com/posts/1072626246134461/introducing-fblearner-flow-facebook-s-ai-backbone/

http://groups.csail.mit.edu/EVO-DesignOpt/groupWebSite/uploads/Site/DSAA_DSM_2015.pdf






http://datarobot.com

http://datarobot.com

Input data

Balanced vs skewed target distribution

The devil is in the detail:○ Partitioning○ Leakage○ Sample size

http://blog.mrtz.org/2015/03/09/competition.html

In [42]: ar2d = numpy.array([[1, 2, 3], [11, 12, 13], [10, 20, 40]], dtype='uint8', order='C')

In [43]: ' '.join(str(ord(x)) for x in ar2d.data)

Out[43]: '1 2 3 11 12 13 10 20 40'

In [44]: ar2df = numpy.array([[1, 2, 3], [11, 12, 13], [10, 20, 40]], dtype='uint8', order='F')

In [45]: ' '.join(str(ord(x)) for x in ar2df.data)

Out[45]: '1 11 10 2 12 20 3 13 40'



Big Data?

Criteo 1tb data:

Data size:● ~46GB/day● ~180,000,000/day● ~3.5% events rate

Raw Data:[email protected]%

Data:[email protected]%(189 GB in columnar parquet format)

Balanced classes:70GB(12 GB parquet)

Scalability! But at what COST?

“You can have a second computer once you’ve shown you know how to use the first one.” – Paul Barham

https://blog.acolyer.org/2015/06/05/scalability-but-at-what-cost/

https://blog.acolyer.org/2015/06/05/scalability-but-at-what-cost/

50 shades of machine learning

Supervised Unsupervised

Semi-supervised

Classification Regression Sequence prediction

Structure prediction

Reinforcement learning

Time series forecasting

Clustering Dimensionality reduction

Topic modeling

Recommendation

Online/Streaming ML

Ranking

Survival Analysis

Anomaly detection

Buzzword maker: REALTIME + BIGDATA + 1 or 2 boxes above = Profit

Model state (knowledge) vs hyperparameters

LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION

* Pedro Domingos, A few useful things to know about machine learning, 2012.

Evaluation = LossFunction(Prediction, True label)

http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

OptimizationModel parameters Hyperparameters

Combinatorial optimization:● Greedy search ● Beam search ● Branch-and-bound

Continuous optimization❖ Unconstrained ❏ Gradient descent ❏ Conjugate gradient ❏ Quasi-Newton methods ❖ Constrained ❏ Linear programming ❏ Quadratic programming

● Grid search● Random Search● Bayesian Optimization● Tree of Parzen Estimators (TPE)● Gradient based optimization

Distributed Machine Learning

Model fits in memory

Data fits in memory

Yes No

Yes

No Distributed data (hdfs, spark)

Distributed data, distributed models

Distributed Machine Learning

Data1 Model 1...DataN Model N

Model Data Parallelism

http://parameterserver.org/https://github.com/intel-machine-learning/DistMLhttp://www.dmtk.io/https://petuum.github.io/bosen.html

Model

http://parameterserver.org/

http://parameterserver.org/

https://github.com/intel-machine-learning/DistML

https://github.com/intel-machine-learning/DistML

http://www.dmtk.io/

http://www.dmtk.io/

https://petuum.github.io/bosen.html

https://petuum.github.io/bosen.html

Speed up distributed machine learning

● Approximate all the things● Update asynchronously ● Early stopping

We draw inspiration from the high-level programming models of dataflow systems, and the low-level efficiency of parameter servers.

TensorFlow: A system for large-scale machine learning

A better model when time is the constraint

https://arxiv.org/pdf/1605.08695v1.pdf

https://arxiv.org/pdf/1605.08695v1.pdf

Сost based optimization

Automating Model Search for Large Scale Machine Learning

Apache SystemMLAutomatic OptimizationAlgorithms specified in DML and PyDML are dynamically compiled and optimized based on data and cluster characteristics using rule-based and cost-based optimization techniques. The optimizer automatically generates hybrid runtime execution plans ranging from in-memory single-node execution to distributed computations on Spark or Hadoop. This ensures both efficiency and scalability. Automatic optimization reduces or eliminates the need to hand-tune distributed runtime execution plans and system configurations.

https://amplab.cs.berkeley.edu/wp-content/uploads/2015/07/163-sparks.pdf


http://systemml.apache.org/index.html

http://systemml.apache.org/index.html

Ensembles● Bagging.

● Boosting.

● Blending.

● Stacking.

Dark knowledge

http://www.ttic.edu/dl/dark14.pdf https://www.youtube.com/watch?v=EK61htlw8hY

http://www.ttic.edu/dl/dark14.pdf

https://www.youtube.com/watch?v=EK61htlw8hY

https://www.youtube.com/watch?v=EK61htlw8hY

http://www.ttic.edu/dl/dark14.pdf

Test time prediction

● Different environment● Different hardware ● Different requirements

Types of model transferring1. Model serialization:- Bound to a single language- Bound to a single version

2. Metadata + data (Spark-2.0)(https://tensorflow.github.io/serving/) 3. PMML (http://dmg.org/pmml/v4-2-1/GeneralStructure.html) 4. PFA (http://dmg.org/pfa/index.html) 5. Code generation (h2o.ai)

https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html

https://tensorflow.github.io/serving/



http://dmg.org/pmml/v4-2-1/GeneralStructure.html



http://dmg.org/pfa/index.html

http://h2o.ai

http://tullo.ch/articles/decision-tree-evaluation/https://blog.acolyer.org/2016/02/29/machine-learning-the-high-interest-credit-card-of-technical-debt/https://blog.acolyer.org/2016/03/01/ad-click-prediction-a-view-from-the-trenches/Automating Model Search for Large Scale Machine Learning

Papers & articles

http://tullo.ch/articles/decision-tree-evaluation/

http://tullo.ch/articles/decision-tree-evaluation/

https://blog.acolyer.org/2016/02/29/machine-learning-the-high-interest-credit-card-of-technical-debt/

https://blog.acolyer.org/2016/02/29/machine-learning-the-high-interest-credit-card-of-technical-debt/

https://blog.acolyer.org/2016/03/01/ad-click-prediction-a-view-from-the-trenches/

https://blog.acolyer.org/2016/03/01/ad-click-prediction-a-view-from-the-trenches/



Thanks, QA

ai&bigdata lab 2016. Руденко Петр: Особенности обучения,...

Technology