apache mahout

22
Guided By Ms. Shikha Pachouly  Assistant Professor Computer Engineering Department 6/21/2014

Upload: amol-jagtap

Post on 14-Oct-2015

66 views

Category:

Documents


0 download

TRANSCRIPT

  • 5/24/2018 Apache Mahout

    1/22

    Guided By

    Ms. Shikha PachoulyAssistant Professor

    Computer Engineering

    Department

    6/21/2014

  • 5/24/2018 Apache Mahout

    2/22

    Machine Learning Machine learning is programming computers to

    optimize a performance criterion using example dataor past experience.

    Machine Learning Strategies

    1) Supervised

    2)Unsupervised

    6/21/2014

  • 5/24/2018 Apache Mahout

    3/22

    Common Use Cases Recommend friends/dates/products

    Classify content into predefined groups

    Find similar content based on object properties Find associations/patterns in action/behaviors

    Identify key topics in large collection of text

    Detect anomalies in output

    Ranking search results

    6/21/2014

  • 5/24/2018 Apache Mahout

    4/22

    Apache Mahout Introduction Machine Learning Library for Scalable applications

    Includes core algorithms for Recommendation,Clustering and Classification that are implemented ontop of Hadoop Map-Reduce model.

    Also includes core libraries are highly optimized toallow for good performance also for non-distributedalgorithms.

    6/21/2014

  • 5/24/2018 Apache Mahout

    5/226/21/2014

  • 5/24/2018 Apache Mahout

    6/22

    Mahout is distributed under a commercially friendlyApache Software license.

    The goal of Mahout is to build a vibrant, responsive,diverse community to facilitate discussions not only on

    the project itself but also on potential use cases.

    Currently Mahout supports mainly three use cases:

    1) Recommendation mining

    2) Clustering

    3) Classification

    6/21/2014

  • 5/24/2018 Apache Mahout

    7/22

    Why Mahout Many Open Source ML libraries (PyBrain, Shark etc)

    either

    1) lack community

    2) lack scalability

    3) lack documentations and examples

    Most Mahout implementations are Map Reduceenabled

    6/21/2014

  • 5/24/2018 Apache Mahout

    8/22

    The main goal of Apache Mahout is to be useful topractitioners.

    -This means implementations should be easy touse from within Java applications.

    -It should be close to trivial to deploy thetrained models.

    -Scaling to include more and more diverse datashould be simple.

    6/21/2014

  • 5/24/2018 Apache Mahout

    9/22

    Recommendations

    Extensive Framework for collaborative filtering

    Recommenders

    1) user based

    2) item based Many different similarity measures

    e.g. Cosine, LLR, Tanimoto, Pearson,

    6/21/2014

  • 5/24/2018 Apache Mahout

    10/22

    Algorithms For Recommendatation User-Based Collaborative FilteringSingle Machine

    Item-Based Collaborative Filtering - single machine /

    Mapreduce

    Matrix Factorization with Alternating Least Squares -

    single machine / MapReduce

    Matrix Factorization with Alternating Least Squares on

    Implicit Feedback- single machine / MapReduce Weighted Matrix Factorization, SVD++, Parallel SGD -

    single machine

    6/21/2014

  • 5/24/2018 Apache Mahout

    11/22

    User-Based Recommender

    6/21/2014

  • 5/24/2018 Apache Mahout

    12/22

    6/21/2014

  • 5/24/2018 Apache Mahout

    13/22

    Clustering

    6/21/2014

  • 5/24/2018 Apache Mahout

    14/22

    Algorithms for Clustering K-Means Clustering

    Fuzzy K-Means

    Mean Shift Clustering Dirichlet Process Clustering (For Topic Modelling)

    6/21/2014

  • 5/24/2018 Apache Mahout

    15/22

    We can use commands instead of Clustering algorithmsthat can run on Hadoop infrastructure

    e.g. for Canopy Clustering command is

    bin/mahoutorg.apache.mahout.clustering.syntheticcontrol.canopy.Job

    k-Means Clusteringbin/mahoutorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job

    Fuzzy k-Means Clusteringbin/mahoutorg.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job

    6/21/2014

  • 5/24/2018 Apache Mahout

    16/22

    Classification

    Algorithms implemented in Mahout for Classifiaction

    Logistic Regression - trained via SGD - single machine

    Naive Bayes/ Complementary Naive Bayes -MapReduce

    Random Forest - MapReduce

    Hidden Markov Models - single machine

    Multilayer Perceptron - single machine

    6/21/2014

  • 5/24/2018 Apache Mahout

    17/22

    Running Nave Bayes from

    Command Line Three Commands

    1) mahout seq2sparse

    performs TF/IDF transformations

    2) mahout trainnb

    model is trained by using Byes Model

    3) mahout testnb

    classification and testing is performed.

    6/21/2014

  • 5/24/2018 Apache Mahout

    18/22

    Installation of Mahout Download the tar files of both apache-mahout and

    apache-maven projects

    Unzip the tar files in a directory Set the Path Variables for maven

    Set present working directory to the mahout's corefolder

    Compile the project by 'mvn-compile' Build the project by 'mvn-install'

    6/21/2014

  • 5/24/2018 Apache Mahout

    19/22

    Mahout Vs WekaBase\ Technologies Mahout WEKA

    Scalability More Less

    Algorithms Less More

    GUI No Yes

    License Apache GPL

    6/21/2014

  • 5/24/2018 Apache Mahout

    20/22

    MAHOUT COMMERCIAL USERS

    Adobe: Uses clustering algorithms to increase videoconsumption by better user targeting.

    Amazon: For Personalization platform. AOL: For shopping recommendations. Twitter: Uses Mahouts LDA implementation for user interest

    modeling. Yahoo! Mail: Uses Mahouts Frequent Pattern Set Mining. Drupal: Users Mahout to provide open source content

    recommendation solutions. Evolv: Uses Mahout for its Workforce Predictive Analytics

    platform. Foursquare: Uses Mahout for its recommendation engine . Idealo: Uses Mahouts recommendation engine.

    6/21/2014

  • 5/24/2018 Apache Mahout

    21/22

    References Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen, on ScalableSentiment Classification for Big Data Analysis Using NaveBayes Classifier,2013 IEEE International Conference on Big Data.

    Rui Mximo Esteves, Chunming Rong, Using Mahout for clusteringWikipedias latest Articles, 2011 Third IEEE International Conference on CloudComputing Technology and Science.

    Kathleen Ericson and Shrideep Pallickara, On the Performance of DistributedData Clustering Algorithms in File and Streaming Processing Systems, 2011Fourth IEEE International Conference on Utility and Cloud Computing.

    https://mahout.apache.org/

    Sean Owen, Robin Anil , Mahout In Action, Manning Publications

    6/21/2014

  • 5/24/2018 Apache Mahout

    22/22

    THANK YOU

    6/21/2014