apache mahout: driving the yellow elephant

Post on 10-May-2015

8.363 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Apache Mahout – Driving the Yellow ElephantGrant Ingersoll

TriHUG http://www.trihug.org

Lucid Imagination, Inc.

Anyone Here Use Machine Learning?

•Any users of:o Google?

Search?

Priority Inbox?

o Facebook?

o Twitter?

o LinkedIn?

Lucid Imagination, Inc.

Topics

•What is Machine Learning?

•ML Use Cases

•What is Mahout?

•A Word on Scaling

•What can I do with it right now?

•Mahout and Hadoop: An Example

Lucid Imagination, Inc.

Amazon.com

What is Machine Learning?

Google News

Lucid Imagination, Inc.

Really it’s…

•“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”o Intro. To Machine Learning by E. Alpaydin

•Subset of Artificial Intelligence

•Lots of related fields:o Information Retrieval

o Stats

o Biology

o Linear algebra

o Many more

Lucid Imagination, Inc.

Common Use Cases

•Recommend friends/dates/products

•Classify content into predefined groups

•Find similar content based on object properties

•Find associations/patterns in actions/behaviors

•Identify key topics in large collections of text

•Detect anomalies in machine output

•Ranking search results

•Others?

Lucid Imagination, Inc.

Apache Mahout

•An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software Licenseo http://mahout.apache.org

•Why Mahout?o Many Open Source ML libraries either:

Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License ;-) Or are research-oriented

http://dictionary.reference.com/browse/mahout

Lucid Imagination, Inc.

Who uses Mahout?

https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

Lucid Imagination, Inc.

What does scalable mean?

•Ted Dunning (Mahout committer):o As data grows linearly, either scale linearly in time or in machines

o 2X data requires 2X time or 2X machines (or less!)

•Goal: Be as fast and efficient as possible given the intrinsic design of the algorithmo Some algorithms won’t scale to massive machine clusters

o Others fit logically on a Map Reduce framework like Apache Hadoop

o Still others will need different distributed programming models

o Be pragmatic

Lucid Imagination, Inc.

What Can I do with Mahout Right Now?

Lucid Imagination, Inc.

Recommendations

•Extensive framework for collaborative filtering

•Recommenderso User based

o Item based

•Online and Offline supporto Offline can utilize Hadoop

•Many different Similarity measureso Cosine, LLR, Tanimoto, Pearson, others

It’s Valentine’s Day soon!

Clustering

• Document levelo Group documents based

on a notion of similarityo K-Means, Fuzzy K-Means,

Dirichlet, Canopy, Mean-Shift

o Distance Measures Manhattan, Euclidean,

other

• Topic Modeling o Cluster words across

documents to identify topics

o Latent Dirichlet Allocation

Categorization

• Place new items into predefined categories:o Sports, politics, entertainment

o Recommenders

• Implementationso Naïve Bayes

o Compl. Naïve Bayes

o Decision Forests

o Linear Regression

•See Chapter 17 of Mahout in Action for Shop It To Me use case:• http://awe.sm/5FyNe

Freq. Pattern Mining

• Identify frequently co-occurrent items

• Useful for:o Query Recommendations

Apple -> iPhone, orange, OS X

o Related product placement Basket Analysis

http://www.amazon.com

Lucid Imagination, Inc.

Evolutionary

•Map-Reduce ready fitness functions for genetic programming

•Integration with Watchmakero http://watchmaker.uncommons.org/index.php

•Problems solved:o Traveling salesman

o Class discovery

o Many others

•Caveat: Hasn’t received as much attention as others

Lucid Imagination, Inc.

Other

•Primitive Collections!

•Math libraryo Vectors, Matrices, etc.

•Noise Reduction via Singular Value Decomposition

•Export from Lucene/Solr and other formats

Lucid Imagination, Inc.

Mahout and Hadoop

•Most Mahout implementations are built on Map-Reduceo Many also have sequential implementations

o Linear Regression is blazingly fast without needing M/R

•Let’s look at how K-Means is implemented in Mahout

Lucid Imagination, Inc.

K-Means

•Clustering Algorithmo Nicely parallelizable!

http://en.wikipedia.org/wiki/K-means_clustering

Lucid Imagination, Inc.

K-Means in Map-Reduce

•Input:o Mahout Vectors representing the original content

o Either: A predefined set of initial centroids (Can be from Canopy)

--k – The number of clusters to produce

•Iterateo Do the centroid calculation (more in a moment)

•Clustering Step (optional)

•Outputo Centroids (as Mahout Vectors)

o Points for each Centroid (if Clustering Step was taken)

Lucid Imagination, Inc.

Map-Reduce Iteration

•Each Iteration calculates the Centroids using:o KMeansMapper

o KMeansCombiner

o KMeansReducer

•Clustering Stepo Calculate the points for each Centroid using:

o KMeansClusterMapper

Lucid Imagination, Inc.

KMeansMapper

•During Setup:o Load the initial Centroids (or the Centroids from

the last iteration)

•Map Phaseo For each input

Calculate it’s distance from each Centroid and output the closest one

•Distance Measures are pluggableo Manhattan, Euclidean, Squared Euclidean,

Cosine, others

Lucid Imagination, Inc.

KMeansReducer

•Setup:o Load up clusters

o Convergence information

o Partial sums from KMeansCombiner (more in a moment)

•Reduce Phaseo Sum all the vectors in the

cluster to produce a new Centroid

o Check for Convergence

•Output cluster

Lucid Imagination, Inc.

KMeansCombiner

•A Combiner is like a Map-side Reducer which helps save on IO

•Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper

Lucid Imagination, Inc.

KMeansClusterMapper

•Some applications only care about what the Centroids are, so this step is optional

•Setup:o Load up the clusters and the DistanceMeasure used

•Map Phaseo Calculate which Cluster the point belongs to

o Output <ClusterId, Vector>

Lucid Imagination, Inc.

Summary

•Machine learning is all over the web today

•Mahout is about scalable machine learning

•Mahout has functionality for many of today’s common machine learning tasks

•Many Mahout implementations use Hadoop

•KMeans clustering is an example of a machine learning algorithm in Mahout that is implemented using Map Reduce

Lucid Imagination, Inc.

Resources

•http://mahout.apache.org•http://cwiki.apache.org/MAHOUT•{user|dev}@mahout.apache.org

•http://svn.apache.org/repos/asf/mahout/trunk

•http://hadoop.apache.org

Lucid Imagination, Inc.

Resources

•“Mahout in Action” by Owen, Anil, Dunning and Friedmano http://awe.sm/5FyNe

•“Introducing Apache Mahout” o http://www.ibm.com/developerworks/java/library/j-mahout/

•“Programming Collective Intelligence” by Toby Segaran

•“Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank

Lucid Imagination, Inc.

References•HAL: http://en.wikipedia.org/wiki/File:Hal-9000.jpg

• Terminator: http://en.wikipedia.org/wiki/File:Terminator1984movieposter.jpg

•Matrix: http://en.wikipedia.org/wiki/File:The_Matrix_Poster.jpg

•Google News: http://news.google.com

•Amazon.com: http://www.amazon.com

• Facebook: http://www.facebook.com

• Couple: http://www.vlemx.com/

•Beer and Diapers: http://www.flickr.com/photos/baubcat/2484459070/o http://www.theregister.co.uk/2006/08/15/beer_diapers/

•DMOZ: http://www.dmoz.org

• Shopping Cart: http://themeanestmom.blogspot.com/2010/09/shopping-carts.html

top related