pymadlib - a python wrapper for madlib : in-database, parallel, machine learning library
DESCRIPTION
These are slides from my talk @ DataDay Texas, in Austin on 30 Mar 2013 (http://2013.datadaytexas.com/schedule) Favorite and Fork PyMADlib on GitHub: https://github.com/gopivotal/pymadlib MADlib: http://madlib.netTRANSCRIPT
1© Copyright 2011 EMC Corporation. All rights reserved.
Srivatsan RamanujamSenior Data Scientist
Greenplum
2© Copyright 2011 EMC Corporation. All rights reserved.
Agenda
• Greenplum UAP overview– Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance– GPDB Architecture
• MADlib– Overview– Algorithms– Working Mechanism– Performance Comparison with Mahout
• PyMADlib– Overview– Demo in IPython Notebook
• Future Directions– GPHD and HAWQ
3© Copyright 2011 EMC Corporation. All rights reserved.
Greenplum Overview
4© Copyright 2011 EMC Corporation. All rights reserved.
Products
5© Copyright 2011 EMC Corporation. All rights reserved.
MPP (Massively Parallel Processing) Shared-Nothing Architecture
NetworkInterconnect
... ...
......MasterServers
Query planning & dispatch
SegmentServers
Query processing & data storage
SQL
MapReduce
ExternalSources
Loading, streaming, etc.
Greenplum Database - Architecture
6© Copyright 2011 EMC Corporation. All rights reserved.
MADlib
7© Copyright 2011 EMC Corporation. All rights reserved.
MADlib: The Origin
UrbanDictionary.com:mad (adj.): an adjective used to enhance a noun.1- dude, you got skills.2- dude, you got mad skills.
• First mention of MAD analytics was at VLDB’09 – MAD Skills: New Analysis Practices for Big Data– Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein,
Caleb Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf
• MADlib project initiated in late 2010– Maintained by Greenplum/EMC with significant
contributions from UW Madison, UFlorida and UC Berkeley.
8© Copyright 2011 EMC Corporation. All rights reserved.
Current Modules
Data Modeling
Supervised Learning• Naive Bayes Classification• Linear Regression• Logistic Regression• Multinomial Logistic Regression• Decision Tree• Random Forest• Support Vector Machines• Cox-Proportional Hazards Regression• Conditional Random Field
Unsupervised Learning• Association Rules• k-Means Clustering• Low-rank Matrix Factorization• SVD Matrix Factorization• Parallel Latent Dirichlet Allocation
Descriptive Statistics
Sketch-based Estimators• CountMin (Cormode-
Muthukrishnan)• FM (Flajolet-Martin)• MFV (Most Frequent Values)
Profile
Quantile
Support
Array Operations
Conjugate Gradient
Sparse Vectors
Probability Functions
Random Sampling
Inferential Statistics
Hypothesis tests
9© Copyright 2011 EMC Corporation. All rights reserved.
MADlib – User Doc• Check out the user guide with examples at: http://doc.madlib.net
10© Copyright 2011 EMC Corporation. All rights reserved.
How does it work ? : A Linear Regression Example• Finding linear dependencies between variables
– y ≈ c0 + c1 · x1 + c2 · x2 ?
# select y, x1, x2 from unm limit 6;
y | x1 | x2 -------+------+----- 10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design
matrix X
Vector of dependent variables y
11© Copyright 2011 EMC Corporation. All rights reserved.
Reminder: Linear-Regression Model• • If residuals i.i.d. Gaussians with standard deviation σ:
– max likelihood ⇔ min sum of squared residuals
• First-order conditions for the following quadratic objective (in c)
yield the minimizer
12© Copyright 2011 EMC Corporation. All rights reserved.
Linear Regression: Streaming Algorithm• How to compute with a single table scan?
XT
X
XT
y
-1
XTX XTy
13© Copyright 2011 EMC Corporation. All rights reserved.
Linear Regression: Parallel Computation
XT
y
XT
1y
1XT
2y
2
Segment 1
Segment 2 XTyMaster
14© Copyright 2011 EMC Corporation. All rights reserved.
Performance Comparison : Test Setup on AWB
• AWB– 1000-node cluster located in Las Vegas– Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk
storage– 8000+ Map Task Capacity, 5000+ Reduce Task Capacity– GPHD 1.1, GPDB 4.2.3
• Mahout v0.7
• MADlib v0.5– With small LMF change to allow 4-byte integer values
• Test matrix– Data size (# rows/records, # columns/features)– Algorithms– Algorithm parameters (e.g. convergence threshold, # iterations)– GPDB segment / MR (Map-Reduce) task configurations
15© Copyright 2011 EMC Corporation. All rights reserved.
Performance & Scalability Results (summary)
• Whitepaper coming out shortly!
16© Copyright 2011 EMC Corporation. All rights reserved.
Logistic Regression• Mahout only has sequential (i.e. single node) IGD implementation
1000000 10000000 100000000 10000000000
100
200
300
400
500
600
700
MADlib & Mahout Logistic Regression Scalability Across Number of Attributes
Census data, 48 attributes [Mahout]
Census data, 48 attributes [MADlib]
log(Number of Rows)
Tim
e in
Min
ute
s
17© Copyright 2011 EMC Corporation. All rights reserved.
Logistic Regression
50 100 150 200 2500
2
4
6
8
10
12
14
16
18
MADlib Scalability Across Number of GPDB Segments
Number of GPDB Segments
Tim
e in
Min
ute
s
18© Copyright 2011 EMC Corporation. All rights reserved.
K-Means Clustering
1000000 10000000 100000000 10000000000
50
100
150
200
250
300
350
MADlib & Mahout K-means Scalability Across Number of Rows
Census data, 48 attributes [Mahout]
Census data, 48 attributes [MADlib]
log(Number of Rows)
Tim
e in
Min
19© Copyright 2011 EMC Corporation. All rights reserved.
K-Means Clustering
50 100 150 200 2500
1
2
3
4
5
6
7
8
9
10
MADlib K-means Scalability Across Number of GPDB Segments
Number of GPDB Segments
Tim
e in
Min
20© Copyright 2011 EMC Corporation. All rights reserved.
PyMADlib : Python + MADlib = Awesome!
21© Copyright 2011 EMC Corporation. All rights reserved.
Motivation
• Undeniably the most straightforward way to query data
• But not necessarily designed for data science
• SQL is great for many things, but it’s not nearly enough
22© Copyright 2011 EMC Corporation. All rights reserved.
MADlib is a godsend!
• So why do we need anything else? – UI is still all in SQL– Need to tap into rich visualization libraries
• Empowers data scientists to run canned machine learning routines – focus less on coding, more on science
• In-database, explicitly parallel.
23© Copyright 2011 EMC Corporation. All rights reserved.
Then which interface is favored by and familiar to data scientists?
• Depends on who you ask
• Left survey is for “higher level languages,” and right survey is for “lower level languages”
24© Copyright 2011 EMC Corporation. All rights reserved.
Wait, don’t we already have this (PL/R, PL/Python, SAS HPA)?
• PL/X’s are wonderful, but:– It still requires non-trivial knowledge of SQL to use effectively– Mostly limited to explicitly parallel jobs– Primarily a SQL interface to the end user
• Need an interface that is:– Less SQL, more R/Python/SAS– Implicitly parallelized– More scalable
• SAS HPA = $$$$$
25© Copyright 2011 EMC Corporation. All rights reserved.
The challenge
• MADlib – Open source– Extremely powerful/scalable– Growing algorithm breadth– SQL
• Python/R– Open source– Memory limited– High algorithm breadth– Language/interface purpose-designed for data science
• SAS– High user loyalty– Non-HPA is memory limited, HPA requires investment– High algorithm breadth– Language/interface purpose-designed for data science
• Want to leverage both the performance benefits of MADlib and the usability of languages like Python, SAS, and R
26© Copyright 2011 EMC Corporation. All rights reserved.
Simple solution: Translate Python code into SQL
• All data stays in DB and all model estimation and heavy lifting done in DB by MADlib
• Only strings of SQL and model output transferred across ODBC/JDBC
• Best of both worlds: number crunching power of MADlib along with rich set of visualizations of Matplotlib, NetworkX and all your other favorite Python libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL database, while you program in your favorite language – Python.
SQL to execute MADlib
Model output
ODBC/JDBC
Python SQL
27© Copyright 2011 EMC Corporation. All rights reserved.
Demo
PyMADlib Tutorial – IPython Notebook Viewer Link
http://nbviewer.ipython.org/5275846
28© Copyright 2011 EMC Corporation. All rights reserved.
Where do I get it ?
$pip install pymadlib
29© Copyright 2011 EMC Corporation. All rights reserved.
I don’t have GPDB or MADlib – What do I do ?
• Greenplum Database Community Edition is freely available for single node installations on multiple platforms
– Written permission may be requested from EMC/Greenplum for research use for multi-node installations
• MADlib is free and open-source– Downloadable for multiple platforms from https://github.com/madlib/
madlib
• PyMADlib is also free and open-source – Downloadable from https://github.com/vatsan/pymadlib
30© Copyright 2011 EMC Corporation. All rights reserved.
Future Directions
31© Copyright 2011 EMC Corporation. All rights reserved.
Greenplum HD
• HAWQ – Parallel SQL query engine that combines the key technological advantages of industry-leading Greenplum Database with scalability and convenience of Hadoop
• SQL Standards Compliant– Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes + range of
scalar and aggregate functions
• ACID Compliant
32© Copyright 2011 EMC Corporation. All rights reserved.
HAWQ – Architecture
33© Copyright 2011 EMC Corporation. All rights reserved.
Performance : HAWQ1 Vs. Hive Vs. Impala2
All experiments were run on a 60 node deployment with Analytics Workbench3
1 http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf2 https://github.com/cloudera/impala/3 http://www.analyticsworkbench.com/
34© Copyright 2011 EMC Corporation. All rights reserved.
• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• K-Means
• Association Rules
• Latent Dirichlet Allocation
HAWQ: Deep Scalable AnalyticsWhat’s inside the box?
• Users can connect to HAWQ via popular programming languages and it also supports JDBC and ODBC.
• Most tools will work out of the box with HAWQ, including PyMADlib
35© Copyright 2011 EMC Corporation. All rights reserved.
Questions?
https://github.com/vatsan/pymadlib
36© Copyright 2011 EMC Corporation. All rights reserved.
Appendix
37© Copyright 2011 EMC Corporation. All rights reserved.
Datasets
The following datasets were used in comparing the performance of MADlib with Mahout
– KDD Cup 2009 Orange marketing churn data (16.5 MB)• About 500,000 records and 15,000 numerical and categorical attributes
– Census 2000 data (1.7 GB)• About 14 million records and 48 numerical and categorical attributes
– Enron data (1.9 GB)• About 700,000 documents with a vocabulary size of 200,000
– KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB)• About 1 million users, 600,000 songs, and 250 million ratings
– Netflix Prize 2009 data (52.7 MB)• About 400,000 users, 900 movies, and 4.5 million ratings