data science on hadoop

48
© Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoopにおけるデータサイエンス Yifeng Jiang March 10, 2015

Upload: yifeng-jiang

Post on 17-Jul-2015

1.013 views

Category:

Technology


3 download

TRANSCRIPT

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    HadoopYifeng JiangMarch 10, 2015

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    (Yifeng Jiang) Solutions Engineer @ Hortonworks Japan HBase Book Author @uprush

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    ? Hadoop

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    ?

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    ...

    ...

  • Hortonworks Inc. 2011 2015. All Rights Reserved

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    BI

    Business Intelligence: & ; Data Science: & ; ;

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    CDR NPTB

    360

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    ROI

    Amazon: 35%

    Netflix: 75%

    CTR

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    /

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    ...

    OCR

    NLP

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    ETL

    Java Scala

    Python

    NLP

    SQLExcel

    Hadoop PIG HIVE

    SOLR

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    ETL

    Java Scala Python

    NLP

    Hadoop PIG HIVESOLR

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    NLP R MATLAB SAS SQL /

    Hadoop PIG/HIVE Map-Reduce Java Python Perl SQL C++ NoSQL Hbase Cassandra Mongo

  • Hortonworks Inc. 2011 2015. All Rights Reserved

  • Hortonworks Inc. 2011 2015. All Rights Reserved

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    WALL-E 700

  • Hortonworks Inc. 2011 2015. All Rights Reserved

  • Hortonworks Inc. 2011 2015. All Rights Reserved

  • Hortonworks Inc. 2011 2015. All Rights Reserved

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    : CTR

    Rank = bid * CTRCTR CTR Etc

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    Collaborative Filtering

  • Hortonworks Inc. 2011 2015. All Rights Reserved

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    Model

    (Train)

    Feature Matrix

    Feature Vector

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    :

    ID Total$ Age City Target

    101 200 25 SF

    102 350 35 LA

    103 25 15 LA

    Feature Matrix Feature Engineering

    Raw Transforms

    Signal Processing

    OCR

    Geo-spatial

    Normalize

    Transform/aggregate

    Sample

    Dimensionality reduction

    Feature Selection

    NLP

    Mutual Information

    TB, PB

    MB, GB

  • Hortonworks Inc. 2011 2015. All Rights Reserved

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    :

    Shopper ID TX ID Apple Banana Honey Milk Bread

    101 TX 1 4 5 1 1 0

    102 TX 2 0 2 0 1 1

    103 TX 3 0 0 0 0 2

    101 TX 4 1 1 0 0 0

    Apple Banana Honey Milk Bread

    Price $2 $1 $5 $3 $4

    Age City Size of household

    101 25 SF 4

    102 35 LA 3

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    :

    Shopper ID # Tx Total $ Age City

    101 10 $200 25 SF

    102 15 $350 35 LA

    103 2 $25 15 LA

    25 $5 15 NYC

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    :

    - -

    ID Total$ Age City

    101 $200 25 SF 2

    102 $350 35 LA 2

    103 $25 15 LA 1

    1

    1

    2

    2

    2

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    ?

    : 10M , 100 = 8 bytes (double) = ~7.5GB

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    :

    l (70%)(30%)

    l

    l

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    confusion matrix :

    Yes No

    Yes True positives False

    positives

    No False negatives

    True negatives

    Confusion Matrix

    confusion matrix = % of positive predicts that are correct = % of positive instances that were predicts as positiveF1 = a measure of tests accuracy, combining precision and recall= % of correct classications

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    ALS

    MySQL / HBase

    Hadoop

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    Hadoop

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    YARN Data Lake 2013 YARN Hadoop

    YARN Data Lake Data Lake Hadoop

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    ?

    6 9

    Schema change

    HDFS

    ?

    3

    Schema on read

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    Hadoop

    OCR

    NLP

    Hadoop

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    Hadoop

    Feature Engineering

    Raw Transforms

    Signal Processing

    OCR

    Geo-spatial

    Normalize

    Transform/aggregate

    Sample

    Dimensionality reduction

    Feature Selection

    NLP

    Mutual Information

    Frequent Itemset

    Anomaly Detection

    Clustering

    Collaborative Filter

    Regression

    Classication

    Supervised Learning

    Unsupervised Learning

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    Hadoop

    R, Python Scikit-learn or SAS

    Mahout

    Spark ML-Lib:

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    Hadoop

    : R, Python Scikit-learn or SAS

    Mahout () Spark ML-Lib

    Hadoop Grid-search:

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    Hadoop

    :20M PMML (e.g., Zementis, Pattern) Python, R, Java,

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    Hadoop

    Distributed K-means Spark ML-Lib & Mahout

    Collaborative Filtering Alternating Least Squares (ALS) Mahout, Spark ML-Lib, ALS Collaborative FilteringMahout

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    : HadoopR

    R

    R Rstudio Rstudio RCloud

    Hadoop RMR: map-reduce R RHDFS: RHDFS RHIVE: Rhive RHBase: RHbase RODBC

    Rstudio, Rcloud Rhadoop RHive

    R . .

    . . .

    . . R

    YARN

    R high-memory node

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    : Hadoop Python

    Python

    Python UIIpython

    Hadoop PyDoop: PythonHDFS Hadoop Map-reduce

    PIGPython UDFs

    IPython Pandas, Scikit-learn Numpy, Scipy Matplotlib PyDoop

    PythonScikit-learn

    Pandas. .

    . . .

    . .Python

    Scikit-learnPandas

    YARN

    Python high-memory node

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    : HadoopSpark

    Edge NodeSpark ( ML-Lib) Scala API Java API Python API

    SparkYARN

    Spark ML-Lib Edge node

    Spark . .

    . . .

    . . Spark

    YARN

  • Hortonworks Inc. 2011 2015. All Rights Reserved

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    Hadoop

    Hadoop

    HadoopYARN

    Hadoop

  • Hortonworks Inc. 2011 2015. All Rights Reserved

    Thank You! Yifeng Jiang Solutions Engineer