apache mahout

5/24/2018 Apache Mahout

1/22

Guided By

Ms. Shikha PachoulyAssistant Professor

Computer Engineering

Department

6/21/2014


2/22

Machine Learning Machine learning is programming computers to

optimize a performance criterion using example dataor past experience.

Machine Learning Strategies

1) Supervised

2)Unsupervised

6/21/2014


3/22

Common Use Cases Recommend friends/dates/products

Classify content into predefined groups

Find similar content based on object properties Find associations/patterns in action/behaviors

Identify key topics in large collection of text

Detect anomalies in output

Ranking search results

6/21/2014


4/22

Apache Mahout Introduction Machine Learning Library for Scalable applications

Includes core algorithms for Recommendation,Clustering and Classification that are implemented ontop of Hadoop Map-Reduce model.

Also includes core libraries are highly optimized toallow for good performance also for non-distributedalgorithms.

6/21/2014


5/226/21/2014


6/22

Mahout is distributed under a commercially friendlyApache Software license.

The goal of Mahout is to build a vibrant, responsive,diverse community to facilitate discussions not only on

the project itself but also on potential use cases.

Currently Mahout supports mainly three use cases:

1) Recommendation mining

2) Clustering

3) Classification

6/21/2014


7/22

Why Mahout Many Open Source ML libraries (PyBrain, Shark etc)

either

1) lack community

2) lack scalability

3) lack documentations and examples

Most Mahout implementations are Map Reduceenabled

6/21/2014


8/22

The main goal of Apache Mahout is to be useful topractitioners.

-This means implementations should be easy touse from within Java applications.

-It should be close to trivial to deploy thetrained models.

-Scaling to include more and more diverse datashould be simple.

6/21/2014


9/22

Recommendations

Extensive Framework for collaborative filtering

Recommenders

1) user based

2) item based Many different similarity measures

e.g. Cosine, LLR, Tanimoto, Pearson,

6/21/2014


10/22

Algorithms For Recommendatation User-Based Collaborative FilteringSingle Machine

Item-Based Collaborative Filtering - single machine /

Mapreduce

Matrix Factorization with Alternating Least Squares -

single machine / MapReduce

Matrix Factorization with Alternating Least Squares on

Implicit Feedback- single machine / MapReduce Weighted Matrix Factorization, SVD++, Parallel SGD -

single machine

6/21/2014


11/22

User-Based Recommender

6/21/2014


12/22

6/21/2014


13/22

Clustering

6/21/2014


14/22

Algorithms for Clustering K-Means Clustering

Fuzzy K-Means

Mean Shift Clustering Dirichlet Process Clustering (For Topic Modelling)

6/21/2014


15/22

We can use commands instead of Clustering algorithmsthat can run on Hadoop infrastructure

e.g. for Canopy Clustering command is

bin/mahoutorg.apache.mahout.clustering.syntheticcontrol.canopy.Job

k-Means Clusteringbin/mahoutorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job

Fuzzy k-Means Clusteringbin/mahoutorg.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job

6/21/2014


16/22

Classification

Algorithms implemented in Mahout for Classifiaction

Logistic Regression - trained via SGD - single machine

Naive Bayes/ Complementary Naive Bayes -MapReduce

Random Forest - MapReduce

Hidden Markov Models - single machine

Multilayer Perceptron - single machine

6/21/2014


17/22

Running Nave Bayes from

Command Line Three Commands

1) mahout seq2sparse

performs TF/IDF transformations

2) mahout trainnb

model is trained by using Byes Model

3) mahout testnb

classification and testing is performed.

6/21/2014


18/22

Installation of Mahout Download the tar files of both apache-mahout and

apache-maven projects

Unzip the tar files in a directory Set the Path Variables for maven

Set present working directory to the mahout's corefolder

Compile the project by 'mvn-compile' Build the project by 'mvn-install'

6/21/2014


19/22

Mahout Vs WekaBase\ Technologies Mahout WEKA

Scalability More Less

Algorithms Less More

GUI No Yes

License Apache GPL

6/21/2014


20/22

MAHOUT COMMERCIAL USERS

Adobe: Uses clustering algorithms to increase videoconsumption by better user targeting.

Amazon: For Personalization platform. AOL: For shopping recommendations. Twitter: Uses Mahouts LDA implementation for user interest

modeling. Yahoo! Mail: Uses Mahouts Frequent Pattern Set Mining. Drupal: Users Mahout to provide open source content

recommendation solutions. Evolv: Uses Mahout for its Workforce Predictive Analytics

platform. Foursquare: Uses Mahout for its recommendation engine . Idealo: Uses Mahouts recommendation engine.

6/21/2014


21/22

References Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen, on ScalableSentiment Classification for Big Data Analysis Using NaveBayes Classifier,2013 IEEE International Conference on Big Data.

Rui Mximo Esteves, Chunming Rong, Using Mahout for clusteringWikipedias latest Articles, 2011 Third IEEE International Conference on CloudComputing Technology and Science.

Kathleen Ericson and Shrideep Pallickara, On the Performance of DistributedData Clustering Algorithms in File and Streaming Processing Systems, 2011Fourth IEEE International Conference on Utility and Cloud Computing.

https://mahout.apache.org/

Sean Owen, Robin Anil , Mahout In Action, Manning Publications

6/21/2014


22/22

THANK YOU

6/21/2014

apache mahout

Documents