data science on hadoop
TRANSCRIPT
-
Hortonworks Inc. 2011 2015. All Rights Reserved
HadoopYifeng JiangMarch 10, 2015
-
Hortonworks Inc. 2011 2015. All Rights Reserved
(Yifeng Jiang) Solutions Engineer @ Hortonworks Japan HBase Book Author @uprush
-
Hortonworks Inc. 2011 2015. All Rights Reserved
? Hadoop
-
Hortonworks Inc. 2011 2015. All Rights Reserved
?
-
Hortonworks Inc. 2011 2015. All Rights Reserved
...
...
-
Hortonworks Inc. 2011 2015. All Rights Reserved
-
Hortonworks Inc. 2011 2015. All Rights Reserved
BI
Business Intelligence: & ; Data Science: & ; ;
-
Hortonworks Inc. 2011 2015. All Rights Reserved
CDR NPTB
360
-
Hortonworks Inc. 2011 2015. All Rights Reserved
ROI
Amazon: 35%
Netflix: 75%
CTR
-
Hortonworks Inc. 2011 2015. All Rights Reserved
/
-
Hortonworks Inc. 2011 2015. All Rights Reserved
...
OCR
NLP
-
Hortonworks Inc. 2011 2015. All Rights Reserved
ETL
Java Scala
Python
NLP
SQLExcel
Hadoop PIG HIVE
SOLR
-
Hortonworks Inc. 2011 2015. All Rights Reserved
ETL
Java Scala Python
NLP
Hadoop PIG HIVESOLR
-
Hortonworks Inc. 2011 2015. All Rights Reserved
NLP R MATLAB SAS SQL /
Hadoop PIG/HIVE Map-Reduce Java Python Perl SQL C++ NoSQL Hbase Cassandra Mongo
-
Hortonworks Inc. 2011 2015. All Rights Reserved
-
Hortonworks Inc. 2011 2015. All Rights Reserved
-
Hortonworks Inc. 2011 2015. All Rights Reserved
WALL-E 700
-
Hortonworks Inc. 2011 2015. All Rights Reserved
-
Hortonworks Inc. 2011 2015. All Rights Reserved
-
Hortonworks Inc. 2011 2015. All Rights Reserved
-
Hortonworks Inc. 2011 2015. All Rights Reserved
: CTR
Rank = bid * CTRCTR CTR Etc
-
Hortonworks Inc. 2011 2015. All Rights Reserved
Collaborative Filtering
-
Hortonworks Inc. 2011 2015. All Rights Reserved
-
Hortonworks Inc. 2011 2015. All Rights Reserved
Model
(Train)
Feature Matrix
Feature Vector
-
Hortonworks Inc. 2011 2015. All Rights Reserved
:
ID Total$ Age City Target
101 200 25 SF
102 350 35 LA
103 25 15 LA
Feature Matrix Feature Engineering
Raw Transforms
Signal Processing
OCR
Geo-spatial
Normalize
Transform/aggregate
Sample
Dimensionality reduction
Feature Selection
NLP
Mutual Information
TB, PB
MB, GB
-
Hortonworks Inc. 2011 2015. All Rights Reserved
-
Hortonworks Inc. 2011 2015. All Rights Reserved
:
Shopper ID TX ID Apple Banana Honey Milk Bread
101 TX 1 4 5 1 1 0
102 TX 2 0 2 0 1 1
103 TX 3 0 0 0 0 2
101 TX 4 1 1 0 0 0
Apple Banana Honey Milk Bread
Price $2 $1 $5 $3 $4
Age City Size of household
101 25 SF 4
102 35 LA 3
-
Hortonworks Inc. 2011 2015. All Rights Reserved
:
Shopper ID # Tx Total $ Age City
101 10 $200 25 SF
102 15 $350 35 LA
103 2 $25 15 LA
25 $5 15 NYC
-
Hortonworks Inc. 2011 2015. All Rights Reserved
:
- -
ID Total$ Age City
101 $200 25 SF 2
102 $350 35 LA 2
103 $25 15 LA 1
1
1
2
2
2
-
Hortonworks Inc. 2011 2015. All Rights Reserved
?
: 10M , 100 = 8 bytes (double) = ~7.5GB
-
Hortonworks Inc. 2011 2015. All Rights Reserved
:
l (70%)(30%)
l
l
-
Hortonworks Inc. 2011 2015. All Rights Reserved
confusion matrix :
Yes No
Yes True positives False
positives
No False negatives
True negatives
Confusion Matrix
confusion matrix = % of positive predicts that are correct = % of positive instances that were predicts as positiveF1 = a measure of tests accuracy, combining precision and recall= % of correct classications
-
Hortonworks Inc. 2011 2015. All Rights Reserved
ALS
MySQL / HBase
Hadoop
-
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
-
Hortonworks Inc. 2011 2015. All Rights Reserved
YARN Data Lake 2013 YARN Hadoop
YARN Data Lake Data Lake Hadoop
-
Hortonworks Inc. 2011 2015. All Rights Reserved
?
6 9
Schema change
HDFS
?
3
Schema on read
-
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
OCR
NLP
Hadoop
-
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
Feature Engineering
Raw Transforms
Signal Processing
OCR
Geo-spatial
Normalize
Transform/aggregate
Sample
Dimensionality reduction
Feature Selection
NLP
Mutual Information
Frequent Itemset
Anomaly Detection
Clustering
Collaborative Filter
Regression
Classication
Supervised Learning
Unsupervised Learning
-
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
R, Python Scikit-learn or SAS
Mahout
Spark ML-Lib:
-
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
: R, Python Scikit-learn or SAS
Mahout () Spark ML-Lib
Hadoop Grid-search:
-
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
:20M PMML (e.g., Zementis, Pattern) Python, R, Java,
-
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
Distributed K-means Spark ML-Lib & Mahout
Collaborative Filtering Alternating Least Squares (ALS) Mahout, Spark ML-Lib, ALS Collaborative FilteringMahout
-
Hortonworks Inc. 2011 2015. All Rights Reserved
: HadoopR
R
R Rstudio Rstudio RCloud
Hadoop RMR: map-reduce R RHDFS: RHDFS RHIVE: Rhive RHBase: RHbase RODBC
Rstudio, Rcloud Rhadoop RHive
R . .
. . .
. . R
YARN
R high-memory node
-
Hortonworks Inc. 2011 2015. All Rights Reserved
: Hadoop Python
Python
Python UIIpython
Hadoop PyDoop: PythonHDFS Hadoop Map-reduce
PIGPython UDFs
IPython Pandas, Scikit-learn Numpy, Scipy Matplotlib PyDoop
PythonScikit-learn
Pandas. .
. . .
. .Python
Scikit-learnPandas
YARN
Python high-memory node
-
Hortonworks Inc. 2011 2015. All Rights Reserved
: HadoopSpark
Edge NodeSpark ( ML-Lib) Scala API Java API Python API
SparkYARN
Spark ML-Lib Edge node
Spark . .
. . .
. . Spark
YARN
-
Hortonworks Inc. 2011 2015. All Rights Reserved
-
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
Hadoop
HadoopYARN
Hadoop
-
Hortonworks Inc. 2011 2015. All Rights Reserved
Thank You! Yifeng Jiang Solutions Engineer