pymadlib - a python wrapper for madlib : in-database, parallel, machine learning library

1© Copyright 2011 EMC Corporation. All rights reserved.

Srivatsan RamanujamSenior Data Scientist

Greenplum


Agenda

• Greenplum UAP overview– Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance– GPDB Architecture

• MADlib– Overview– Algorithms– Working Mechanism– Performance Comparison with Mahout

• PyMADlib– Overview– Demo in IPython Notebook

• Future Directions– GPHD and HAWQ


Greenplum Overview


Products


MPP (Massively Parallel Processing) Shared-Nothing Architecture

NetworkInterconnect

... ...

......MasterServers

Query planning & dispatch

SegmentServers

Query processing & data storage

SQL

MapReduce

ExternalSources

Loading, streaming, etc.

Greenplum Database - Architecture


MADlib


MADlib: The Origin

UrbanDictionary.com:mad (adj.): an adjective used to enhance a noun.1- dude, you got skills.2- dude, you got mad skills.

• First mention of MAD analytics was at VLDB’09 – MAD Skills: New Analysis Practices for Big Data– Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein,

Caleb Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf

• MADlib project initiated in late 2010– Maintained by Greenplum/EMC with significant

contributions from UW Madison, UFlorida and UC Berkeley.

http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf




Current Modules

Data Modeling

Supervised Learning• Naive Bayes Classification• Linear Regression• Logistic Regression• Multinomial Logistic Regression• Decision Tree• Random Forest• Support Vector Machines• Cox-Proportional Hazards Regression• Conditional Random Field

Unsupervised Learning• Association Rules• k-Means Clustering• Low-rank Matrix Factorization• SVD Matrix Factorization• Parallel Latent Dirichlet Allocation

Descriptive Statistics

Sketch-based Estimators• CountMin (Cormode-

Muthukrishnan)• FM (Flajolet-Martin)• MFV (Most Frequent Values)

Profile

Quantile

Support

Array Operations

Conjugate Gradient

Sparse Vectors

Probability Functions

Random Sampling

Inferential Statistics

Hypothesis tests


MADlib – User Doc• Check out the user guide with examples at: http://doc.madlib.net

http://doc.madlib.net/


How does it work ? : A Linear Regression Example• Finding linear dependencies between variables

– y ≈ c0 + c1 · x1 + c2 · x2 ?

# select y, x1, x2 from unm limit 6;

y | x1 | x2 -------+------+----- 10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design

matrix X

Vector of dependent variables y


Reminder: Linear-Regression Model• • If residuals i.i.d. Gaussians with standard deviation σ:

– max likelihood ⇔ min sum of squared residuals

• First-order conditions for the following quadratic objective (in c)

yield the minimizer


Linear Regression: Streaming Algorithm• How to compute with a single table scan?

XT

X

XT

y

-1

XTX XTy


Linear Regression: Parallel Computation

XT

y

XT

1y

1XT

2y

2

Segment 1

Segment 2 XTyMaster


Performance Comparison : Test Setup on AWB

• AWB– 1000-node cluster located in Las Vegas– Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk

storage– 8000+ Map Task Capacity, 5000+ Reduce Task Capacity– GPHD 1.1, GPDB 4.2.3

• Mahout v0.7

• MADlib v0.5– With small LMF change to allow 4-byte integer values

• Test matrix– Data size (# rows/records, # columns/features)– Algorithms– Algorithm parameters (e.g. convergence threshold, # iterations)– GPDB segment / MR (Map-Reduce) task configurations


Performance & Scalability Results (summary)

• Whitepaper coming out shortly!


Logistic Regression• Mahout only has sequential (i.e. single node) IGD implementation

1000000 10000000 100000000 10000000000

100

200

300

400

500

600

700

MADlib & Mahout Logistic Regression Scalability Across Number of Attributes

Census data, 48 attributes [Mahout]

Census data, 48 attributes [MADlib]

log(Number of Rows)

Tim

e in

Min

ute

s


Logistic Regression

50 100 150 200 2500

2

4

6

8

10

12

14

16

18

MADlib Scalability Across Number of GPDB Segments

Number of GPDB Segments

Tim

e in

Min

ute

s


K-Means Clustering

1000000 10000000 100000000 10000000000

50

100

150

200

250

300

350

MADlib & Mahout K-means Scalability Across Number of Rows

Census data, 48 attributes [Mahout]

Census data, 48 attributes [MADlib]

log(Number of Rows)

Tim

e in

Min


K-Means Clustering

50 100 150 200 2500

1

2

3

4

5

6

7

8

9

10

MADlib K-means Scalability Across Number of GPDB Segments

Number of GPDB Segments

Tim

e in

Min


PyMADlib : Python + MADlib = Awesome!


Motivation

• Undeniably the most straightforward way to query data

• But not necessarily designed for data science

• SQL is great for many things, but it’s not nearly enough


MADlib is a godsend!

• So why do we need anything else? – UI is still all in SQL– Need to tap into rich visualization libraries

• Empowers data scientists to run canned machine learning routines – focus less on coding, more on science

• In-database, explicitly parallel.


Then which interface is favored by and familiar to data scientists?

• Depends on who you ask

• Left survey is for “higher level languages,” and right survey is for “lower level languages”


Wait, don’t we already have this (PL/R, PL/Python, SAS HPA)?

• PL/X’s are wonderful, but:– It still requires non-trivial knowledge of SQL to use effectively– Mostly limited to explicitly parallel jobs– Primarily a SQL interface to the end user

• Need an interface that is:– Less SQL, more R/Python/SAS– Implicitly parallelized– More scalable

• SAS HPA = $$$$$


The challenge

• MADlib – Open source– Extremely powerful/scalable– Growing algorithm breadth– SQL

• Python/R– Open source– Memory limited– High algorithm breadth– Language/interface purpose-designed for data science

• SAS– High user loyalty– Non-HPA is memory limited, HPA requires investment– High algorithm breadth– Language/interface purpose-designed for data science

• Want to leverage both the performance benefits of MADlib and the usability of languages like Python, SAS, and R


Simple solution: Translate Python code into SQL

• All data stays in DB and all model estimation and heavy lifting done in DB by MADlib

• Only strings of SQL and model output transferred across ODBC/JDBC

• Best of both worlds: number crunching power of MADlib along with rich set of visualizations of Matplotlib, NetworkX and all your other favorite Python libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL database, while you program in your favorite language – Python.

SQL to execute MADlib

Model output

ODBC/JDBC

Python SQL


Demo

PyMADlib Tutorial – IPython Notebook Viewer Link

http://nbviewer.ipython.org/5275846





Where do I get it ?

$pip install pymadlib


I don’t have GPDB or MADlib – What do I do ?

• Greenplum Database Community Edition is freely available for single node installations on multiple platforms

– Written permission may be requested from EMC/Greenplum for research use for multi-node installations

• MADlib is free and open-source– Downloadable for multiple platforms from https://github.com/madlib/

madlib

• PyMADlib is also free and open-source – Downloadable from https://github.com/vatsan/pymadlib

https://github.com/madlib/madlib

https://github.com/madlib/madlib

https://github.com/vatsan/pymadlib



Future Directions


Greenplum HD

• HAWQ – Parallel SQL query engine that combines the key technological advantages of industry-leading Greenplum Database with scalability and convenience of Hadoop

• SQL Standards Compliant– Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes + range of

scalar and aggregate functions

• ACID Compliant


HAWQ – Architecture


Performance : HAWQ1 Vs. Hive Vs. Impala2

All experiments were run on a 60 node deployment with Analytics Workbench3

1 http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf2 https://github.com/cloudera/impala/3 http://www.analyticsworkbench.com/

http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf



https://github.com/cloudera/impala/



http://www.analyticsworkbench.com/




• Linear Regression

• Logistic Regression

• Multinomial Logistic Regression

• K-Means

• Association Rules

• Latent Dirichlet Allocation

HAWQ: Deep Scalable AnalyticsWhat’s inside the box?

• Users can connect to HAWQ via popular programming languages and it also supports JDBC and ODBC.

• Most tools will work out of the box with HAWQ, including PyMADlib


Questions?

@[email protected]


mailto:[email protected]

mailto:[email protected]





Appendix


Datasets

The following datasets were used in comparing the performance of MADlib with Mahout

– KDD Cup 2009 Orange marketing churn data (16.5 MB)• About 500,000 records and 15,000 numerical and categorical attributes

– Census 2000 data (1.7 GB)• About 14 million records and 48 numerical and categorical attributes

– Enron data (1.9 GB)• About 700,000 documents with a vocabulary size of 200,000

– KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB)• About 1 million users, 600,000 songs, and 250 million ratings

– Netflix Prize 2009 data (52.7 MB)• About 400,000 users, 900 movies, and 4.5 million ratings

pymadlib - a python wrapper for madlib : in-database, parallel, machine learning library

Technology