2014.7.10 ([email protected])„-이용한... · 순서 오전 도입 빅데이터 분석 개요...
TRANSCRIPT
-
R ()
2014.7.10
v.3
mailto:[email protected]
-
R ,
- /
R (1) -
, ,
R (2) Clickstream Profiling
R (3)
&
2
-
3
-
!
!
(Though Experiment)
http://www.openwith.net/?page_id=766
4
http://www.openwith.net/?page_id=766
-
5
-
Prelude:
-
Berlin on Tolstoy
Analyzed Tolstoy in a popular essay in 1953 based on Archilochus
What was Tolstoy?
Many talents, like a fox
Believed we should be hedgehogs!
-
Fox compared against Hedgehog
The hedgehog knows one big thing... ,
"The fox knows many little things... ,
.
8
-
(1)
Plato Aristotle
Michelangelo Da Vinci
Marx, Churchll, Hitler ???
9
-
(2)
Laozi Sima Qian
Mao Zhou, Deng,
10
-
The Fox and the hedgehog in a project life
vs.
hedgehog risk
Systems Thinking: A Foxy Approach
OODA: A fox dressed like a hedgehog
11
-
Devops
:
++ ~
Cross-functional team,
Widely-shared metrics,
Automating repetitive tasks,
Post-mortem, Regular release,
+ +
-
13
-
I.
II.BI (Business Intelligence)
III. Hadoop
IV.Data Science
14
-
(/, /, /, )
: Excel
DBMS/DW
RDBMS/SQL, NoSQL, ...
ETL, CDC,
15
-
BI
BI
16
BI BSC Balanced Scorecard. .
VBM Value-based Management. .
ABC Activity Based Costing. .
BI OLAP On-line Analytical Processing.
ERP,CRM ERP, CRM, SCM BI
BI
ETL Extraction-Translation-Loading.
DW Data Warehouse. (repository)
BI Portal .
-
Hadoop
Tidal Wave 3VC
Supercomputer High-throughput computing
2 :
, (grid computing)
(MPP)
Scale-Up vs. Scale-Out
BI (Business Intelligence) DW/OLAP/
17
-
Hadoop
Google!
Nutch/Lucene 2006
(Flat linearity)
18
1990 Excite, Alta
Vista, Yahoo,
2000 Google ; PageRank,
GFS/MapReduce
2003~4 Google Papers
2005 Hadoop (D. Cutting & Cafarella)
2006 Apache
-
Google Papers (2003~2010)
Percolator: Handling individual updates
Dremel: Online visualizations
Google File : a distributed file system
MapReduce : to compute their search indices
Pregel: Scalable graph computing
Dremel: Online visualizations 19
-
Big Picture
20
-
Framework
21
-
Hadoop Ecosystems
22
-
Major Influencers
Open Source Tipping Point
Google
Before Google vs. After Google
-
Data Science , ,
OR (Operations Research)
////...
(Statistical Inference), (Parametric),
, , , /Expert System ()
Data Science : HPC + Google shock
, Graph , topology,
AI (ANN, SVM, )
, Semantic web
Fusion of Python or R?, DS + Cloud BDaaS,
24
-
R
25
-
I.
II.R
III.R
IV.
V.
26
-
I. R
27
-
R
+ S . () + packages
/ (DSL :Domain Specific Language) (Windows, Unix, MacOS). .
. (Functional Programming)
, (loop) ,
Script , Interpreter (OOP)
Generic Polymorphic. object
I-1 28
-
II. R
29
-
R
R (Workspace)
(Assignment)
Batch
30
-
R
R CRAN (Comprehensive R Archive Network)
http://www.cran.r-project.org/
31
http://www.cran.r-project.org/http://www.cran.r-project.org/http://www.cran.r-project.org/
-
RStudio
GUI R
RStudio
R Commander
Rstudio
32
-
(comment)
#
: help.start() # help(seq) # seq ?seq # seq RSiteSearch("lm") # help mailing lists
History history() # 25
savehistory(file="myfile") # (".Rhistory )
loadhistory(file="myfile") #
33
-
: rnorm(10)
mean(abs(rnorm(100)); hist(rnorm(10))
R . dataset :
data( ) # Load package .
dataset : help(datasetname)
Session option options() # option
getwd() #
dir.create("d:/Rtraining"); setwd("d:/Rtraining") # \ /
getwd()
34
-
source( )
session script ( ) source("myfile.R") # script (.R .r)
lm(mpg~wt, data=mtcars)
.
object fit
-
sink( ) - (redirect) sink("myfile", append=FALSE, split=FALSE) #
sink() #
append option override () or append
split option . # : ( )
sink("c:/projects/output.txt")
# : ( , )
sink("myfile.txt", append=TRUE, split=TRUE)
, pdf(mygraph.pdf)
36
-
(Package)
Package
= R , .
Packages ( ). install.packages(package)
CRAN Mirror . (e.g. Korea)
session load (session ) library(package)
Library
Package load library() # library packages
search() # load packages
Package help(package=)
37
-
Customization
R Rprofile.site . MS Windows: C:\Program Files\R\R-n.n.n\etc directory.
Rprofile .
Rprofile.site
>
Rprofile.site 2 .First( ) R session
.Last( ) R session
38
-
Batch
(non-interactively) MS Windows ( )
C:/Program Files/R/R-3.0.2/R.exe CMD BATCH C:/Rtraining/a.R
Linux R CMD BATCH [options] my_script.R [outfile]
> sqrt(-2)
[1] NaN
:
In sqrt(-2) : NaN
> q()
39
-
III. R
40
-
R
(Operators)
(Merge)
apply()
41
-
(Assignment)
R =, x+y
> print(x+y)
> x=pi
> x
> rm(x)
42
-
#
age
-
Import from: csv
mydata
-
R
Variable
Continuous (nominal, ratio)
Ordinal
Nominal (categorical)
(identifier), (date)
Factor
45
-
(Mode) (numeric) -
(character) -
(logical) - TRUE, FALSE
FALSE 0, 0 TRUE
(: imaginary number)
Raw (byte)
R (data structure) Vector, matrix
Array, Data frame
List, Class 46
-
(dataset)
data vector c()
> Rev_2012 = c(110,105,120,140) # :
> Rev_2013 = c(105,115,140,135)
> Revenue = cbind(Rev_2012, Rev_2013) # column
> Revenue
Rev_2012 Rev_2013
[1,] 110 105
[2,] 105 115
[3,] 120 140
[4,] 140 135
>
47
-
R vectors (numerical, character, logical)
1
R scalar . (= vector)
(string) mode single-element vector
Matrices
2
Arrays
3
data frames
Column mode ( )
Lists
48
-
is.numeric() is.character()
is.vector() is.matrix()
is.data.frame()
~
(numeric)
as.numeric() FALSE 0 1,2 1,2
(logical) as.logical() 0 FALSE
as.character() 1,2 - 1,2 FALSE FALSE
Factor as.factor() (factor)
Vector as.vector()
Matrix as.matrix() Matrix
as.dataframe() 49
-
- Vector
mode
a x = c(1,3,5,7)
> x
[1] 1 3 5 7
> family = c("", "","","")
> family
[1] "" "" "" ""
> c(T,T,F,T)
[1] TRUE TRUE FALSE TRUE
50
-
Vector indexing Vector (elements) ([ ] )
a[c(2,4)] # 2 4 > new_a new_a
[1] 1.0 5.3 6.0 -2.0 4.0
vector : seq() sequence rep() - vector
Vector (Vectorized Operations) = vector element
: Vector In, Vector Out or Vector In, Matrix Out
51
-
Recycling : 2 vector c(1,2) + c(5,6,7)
Filtering > z w 0]
> w
[1] 5 3
subset()
NA Null NA ; (missing value)
Null : undefined value ( X)
52
-
: Vector
R matrix
vector %*%, +
cbind() rbind()
library(MASS) ginv()
t() (transpose)
: x y
53
-
Matrices
= row column vector mode( or )
column
(nrow=, ncol = )
mymatrix
-
cells
-
Matrix row column apply()
apply(m, dimcode, f, fargs) m = matrix,
dimcode = 1: row , 2: column ,
f= , fargs = optional argts
> m # row
> apply(m, 1, mean)
[1] 3.5 4.5 5.5 6.5 7.5
> # column
> apply(m, 2, mean)
[1] 3 8
> # 2
> apply(m, 1:2, function(x) x/2)
56
-
Matrix > x x x
rbind(), cbind() > B = matrix(c(2, 4, 3, 1, 5, 7), nrow=3, ncol=2)
> C = matrix(c(7, 4, 2), nrow=3, ncol=1)
> cbind(B, C)
57
-
Matrix vector Matrix vector + Matrix
> z z
[,1] [,2]
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
> length(z)
[1] 8
> class(z)
[1] "matrix"
> attributes(z)
$dim
[1] 4 2
58
-
Array
Matrices 2 . : 4 x 3 x 2 3 1~24
> x x[1,,]
[,1] [,2] [,3]
[1,] 1 13 25
[2,] 5 17 29
[3,] 9 21 33
> x[,,1]
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
59
-
List
(ordered collection of objects). (unrelated)
n = c(2, 3, 5)
s = c("aa", "bb", "cc", "dd", "ee")
b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
x = list(n, s, b, 3) # x contains copies of n, s, b
x[2]
x[c(2, 4)]
[[]] . x[[3]]
60
-
List , > z z
> z$c z
> # .
> z[[4]] lapply(list(2:5,35:39), median) # list
> sapply(list(2:5, 35:39), median) # vector/matrix
61
-
Data Frame
= list special case
Column (, , factor ) .
d
-
list . vector
vector .
merge 2 merge() # merge two data frames by ID
total
-
Factor
(nominal or categorical) [ 1... k ] vector
factor() ordered() option .
x
-
Factor tapply()
Vector > ages party tapply(ages, party, mean)
57 30 34
split()
split(x,f) x () > g split(1:7, g)
65
-
(contingency table)
2-way contingency table
: > trial colnames(trial) rownames(trial) trial.table trial.table
sick healthy
risk 34 9
no_risk 11 32
66
-
Dataset
ls() # objects
names(mydata) # mydata
str(mydata) # mydata
levels(mydata$v1) # mydata v1 factor level
dim(object) # object (dimensions)
class(object) # object (numeric, matrix, data frame, ) class
mydata # mydata
head(mydata, n=10) # mydata 10 row
tail(mydata, n=5) # mydata 5 row
67
-
(Operators)
Binary vector, matrix scalar .
Arithmetic Operators
+
-
*
/
^ or **
()
x %% y (x mod y) 5%%2 is 1
x %/% y
integer division 5%/%2 is 2
68
-
< less than
greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x | y x OR y
x & y x AND y
isTRUE(x) test if X is TRUE
69
-
substr(x, start=n1,
stop=n2)
vector substring
grep(pattern, x ,
ignore.case=FALSE,
fixed=FALSE)
Search for pattern in x. fixed =FALSE pattern . fixed=TRUE pattern index grep("A", c("b","A","c"), fixed=TRUE) 2
sub(pattern,
replacement, x,
ignore.case =FALSE,
fixed=FALSE)
x pattern . fixed=FALSE pattern . fixed = T pattern . sub("\\s",".","Hello There") "Hello.There"
strsplit(x, split) element (Split).
strsplit("abc", "") 3 vector . , "a","b","c"
paste(..., sep="") sep (Concatenate)
toupper(x)
tolower(x)
70
-
seq(from , to, by) (sequence) indices
-
expr { } .
if-else if (cond) expr
if (cond) expr1 else expr2
for for (var in seq) expr
while while (cond) expr
switch switch(expr, ...)
ifelse ifelse(test,yes,no)
72
-
order( )
ASCENDING.
sorting # : mtcars
attach(mtcars)
# sort by mpg
newdata
-
IV. R
74
-
R
R
plot()
Plots
(Dot) Plots
(Bar) Plots
(Line Charts)
(Pie Charts)
(Boxplots)
Scatter Plots
75
-
R
R
demo(graphics); > demo(persp)
plot(c(1,2,3),c(1,2,4))
Nile
mean(Nile)
sd(Nile)
hist(Nile)
76
-
plot()
plot( ) object (plot)
Generic density, data frame,
: plot(x,y, arguments)
attach(mtcars)
plot(wt, mpg)
abline(lm(mpg~wt))
title("Regression of MPG on Weight")
plot() :
77
-
Option
type = type=p (point) type=l (line) type=b type=o type=h type=s (step)
xlim =
ylim =
x y . xlim = c(1,10) xlim = range(x)
xlab =
ylba =
x y (label)
main = (main title).
sub = (subtitle).
bg=
bty= 78
-
pch
lty
Option
pch =
lty = 1: (solid line) 2: (dashed) 3: : (dotted) 4: dot-dash
col= : red,green,blue
mar = c(bottom, left, top, right) . c(5,4,4,2) + 0.1
asp = Apsect ratio (= y/x )
79
-
: par(mfrow = c(2,2)) # mfrow multiple plot plot(x,y, type="b", main = "cosie ", sub = "type = b") plot(x,y, type="o", las = 1, bty = "u", sub = "type = o")
plot(x,y, type="h", bty = "7", sub = "type = h")
plot(x,y, type="s", bty = "n", sub = "type = s")
80
-
abline()
abline(a,b) # =a, =b
abline(h=y) #
abline(v=x) # abline(lm.obj) # lm.obj
: data(cars)
attach(cars)
par(mfrow=c(2,2))
plot(speed, dist, pch=1); abline(v=15.4)
plot(speed, dist, pch=2); abline(h=43)
plot(speed, dist, pch=3); abline(-14,3)
plot(speed, dist, pch=8); abline(v=15.4); abline(h=43)
81
-
plotting
Plotting (Dot Plot) dotchart(x, labels=)
x vector, labels .
groups= option x factor . dotchart(mtcars$mpg, labels = row.names(mtcars), cex=.7,
main=" ", xlab = "Gallon mile ")
82
-
# Dotplot: , (: mpg, group), (by cylinder)
x
-
(Bar) Plots
barplot(height)
height vector matrix.
If (height vector)
.
If (height matrix AND option beside=FALSE)
bar height column stacked sub-bars )
If (height matrix AND beside=TRUE)
Column
option names.arg=( ) label
option horiz=TRUE barplot
, Bar plot bar plotting. (mean, median, sd )
aggregate( ) barplot( )
84
-
# Simple Bar Plot
counts
-
Stacked Bar Plot counts
-
Grouped Bar Plot counts
-
(Line Charts)
(Line Charts) lines(x, y, type=)
x y vector
type=
Type Description
p
l
o overplotted points lines
b, c (join) points ("c )
s, S stair steps
h histogram-like vertical lines
n
88
-
lines( ) plot(x, y) .
: plot( ) plots the (x,y) points. plot( ) type="n" option plotting axes, titles .
: x
-
90
-
plot( ) type= options
x
-
pie(x, labels=)
x non-negative numeric vector ( slice )
labels=
slice vector # Simple Pie Chart
slices
-
Pie
# Pie Chart with %
slices
-
Box Plot
plot Box-and-whisker plot , , ,Q1,Q3
Boxplot .
boxplot(x, data= )
x formula, data=
formula (: y~group ), horizontal=TRUE
94
-
# Cylinder MPG
attach(mtcars)
boxplot(mpg~cyl,data=mtcars, main=" Milage ",
xlab="Cylinder ", ylab="Miles Per Gallon")
detach(mtcars)
95
-
(, Scatterplots)
.
Scatterplot plot(x, y) (x, y numeric vector plot )
plot(wt, mpg, main="Scatterplot ",
xlab= ", ylab="Miles Per Gallon ", pch=19)
96
-
V. R
97
-
R
(, , )
()
Crosstabs
98
-
abs(x)
sqrt(x)
ceiling(x) ceiling(3.475) 4
floor(x) floor(3.475) 3
trunc(x) trunc(5.99) 5
round(x, digits=n) round(3.475, digits=2) 3.48
cos(x), sin(x), tan(x) acos(x), cosh(x), acosh(x)
log(x)
log10(x)
exp(x) e^x
factorial(x) factorial(5) 120
99
-
() (random sample) simulation
(d/p/q/r) +
d: (density)
p: (probability)
q: 4 (quantile)
r: (random number)
100
-
dnorm(x) (default m=0 sd=1)
pnorm(q) (area under the normal curve to the right of q)
qnorm(p) normal quantile , p percentile
rnorm(n, m=0,sd=1) n (random normal deviates)
dbinom(x, size, prob)
pbinom(q, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)
(size = , prob = )
dpois(x, lamda)
ppois(q, lamda)
qpois(p, lamda)
rpois(n, lamda)
poisson (m=std=lamda) # lamda=4 0,1, or 2 event dpois(0:2, 4)
dunif(x, min=0, max=1)
punif(q, min=0, max=1)
qunif(p, min=0, max=1)
runif(n, min=0, max=1)
(uniform distribution) #10 uniform random variates
x
-
102
-
na.rm . Object vector .
103
-
mean(x, trim=0,
na.rm=FALSE)
object x # trimmed mean, 5% mx
-
(Descriptive Statistics)
= (summary statistics)
sapply( ) # mydata . ,
sapply(mydata, mean, na.rm=TRUE)
sapply :
mean, sd, var, min, max, median, range, and quantile.
(histogram, density plot, ) .
summary(mydata) # , , 1/3, ,
fivenum(x) # Tukey min,lower-hinge, median,upper-hinge,max
105
-
Histograms hist(x)
x plotting vector
freq=FALSE option breaks= option bin
Histogram .
#
hist(mtcars$mpg)
# .
hist(mtcars$mpg, breaks=12, col="red")
106
-
Plot
(Kernel Density) Plots plot(density(x)) , x vector.
# Kernel Density Plot
d
-
Kernel Density Group sm package sm.density.compare(x, factor)
x vector, factor grouping . superimpose the kernal density plots of two or
more groups.
# MPG (cars with 4,6, or 8 cylinders) library(sm)
attach(mtcars)
# value label (factor . cyl=4,6,8 numeric ) cyl.f
-
109
-
(contingency table)
table( )
prop.table( )
margin.table( ) marginal
2-way contingency table (2 ) ;
110
-
cor( )
cov( )
: cor(x, use=, method= )
Option x Matrix data frame
use . Options: all.obs ( ), complete.obs (listwise deletion), pairwise.complete.obs (pairwise deletion)
method Options: pearson, spearman, kendall.
111
-
# mtcars /. listwise deletion
cor(mtcars, use="complete.obs", method="kendall")
cov(mtcars, use="complete.obs")
cor.test( ) correlation coefficient neither cor( ) or cov( ) produce tests of significance,
Hmisc package rcorr( ) pearson & spearman correlations/covariances , matrix pairwise deletion .
#
library(Hmisc)
rcorr(x, type="pearson") # pearson spearman
rcorr(as.matrix(mtcars)) # mtcars data frame
112
-
cor(X, Y) rcorr(X, Y) column X column Y
# mtcars Correlation matrix
# rows: mpg, cyl, disp
# columns:hp, drat, wt
x
-
(x) (y)
(=) 1 Yi = 0 + i xi + i
(least squares method)
, ,
data(women)
women
fit
-
115
-
I.
II.
III. Taxonomy
IV.
V.
VI.Underfitting Overfitting
VII.Data Exploration
116
-
Data Mining
Predictive Analysis
Data Analysis
Data Science
OLAP
BI
Analytics
Text Mining
SNA (Social Network Analysis)
Modeling
Prediction
Machine Learning
Statistical/Mathematical Analysis
KDD (Knowledge Discovery)
Decision Support System
Simulation
() (Data Analysis), (Data Mining)
117
-
Data Preparation
Data Exploration
Modeling ( )
Evaluation
Deployment
118
-
CRISP-DM
Cross-Industry Standard Process for DM
119
-
120
-
121
-
122
-
123
-
Dataset
124
-
Taxonomy
125
-
(Univariate)
Table
Barplot
Pie chart
Dot chart
Factor
Stem-and-leaf plot
Strip chart
, ,
Variation: Variance, , IQR
Histogram
Mode, Symmetry, Skew
Boxplot
126
-
2 (bivariate)
2-way Table (summarized/unsummarized)
Marginal distribution
2-way Contingency table
Boxplot
Densityplot
Strip chart
Q-Q (quantile-quantile) plot
Scatterplot
2 (correlation)
127
-
(multivariate)
R data frame list
Boxplot xtabs()
split() stack()
Lattice
128
-
Pearson
Spearman Rank
Kendal Rank
129
-
Functions
cor( ) function produces correlations
cov( ) function to produces covariances.
mtcars cor(mtcars, use="complete.obs", method="kendall")
cov(mtcars, use="complete.obs")
130
Option
x Matrix or data frame
use missing . Options are: all.obs (assumes no missing data), complete.obs (listwise deletion), pairwise.complete.obs (pairwise deletion)
method Correlation . Options are: Pearson, Spearman, kendall.
-
, .
Hmisc package rcorr( ) produces correlations/covariances and significance
levels for pearson and spearman correlations. # Correlations with significance levels
library(Hmisc)
rcorr(x, type="pearson") # pearson/spearman
rcorr(as.matrix(mtcars))
cor(X, Y) or rcorr(X, Y) --> X, Y column correlation.
# Correlation matrix from mtcars with mpg, cyl, and disp as rows
# and hp, drat, and wt as columns
x
-
1
2
2
2
1
2
132
-
(Machine Learning)
How do machines learn?
Abstraction Knowledge Representation 133
-
(Supervised ML) We know the labels and the number of classes
(Unsupervised ML) We do not know the labels and may not know the
number of classes
134
-
135
-
Classification
136
-
Underfitting Overfitting
Underfitting
137
-
Overfitting
training data noise .
138
-
Data Exploration
- ()
, () visualization
: Box Plot, Histogram, PCA, charting(Pareto, MV, ...)
139
-
R (1)
140
-
I.
II.
III.
141
-
I.
:
: KNN
R coding : KNN
142
-
(Classification)
(training set ) Class
Model as a function of the values of other attributes.
KNN (K-Nearest Neighbors)
Nave Bayes
Decision Tree
Regression
SVM (Support Vector Machine)`
143
-
144
-
Classification Marketing
(Target Marketing)
Market Segmentation -->
Fraud Detection
/
, , ...
(Attrition/Churn)
(model for loyalty)
(Sky survey)
(star or galaxy/ )
145
-
= A flow-chart-like tree structure
Leaf node
= class label or class label distribution
146
-
heuristic recursive partitioning.
Root node
target class feature tree branch
divide-and-conquer the nodes
() 3 : mainstream
hit/critics choice/box-office bust
: movie script pattern
scatter plot
films proposed shooting budget/the number of A-list celebrities for starring roles/the categories of success
147
-
148
-
149
-
C5.0 decision tree algorithm
best split feature split?
purity : entropy entropy =0: completely homogeneous
entropy =1: maximum disorder
: red (60%), white (40%) entropy
curve() function
150
-
split point ? IG (Information Gain) =split entropy split
entropy . , split ( ) entropy.
feature split homogeneity
IG .
IF IG=0 ; No reduction in entropy
ELSE IF max IG : Entropy prior to the split Entropy after the split=0 . , split completely homogeneous!
Pruning the decision tree
Decision tree can continue to grow indefinitely (, overly specific) pre-pruning post-pruning
151
-
Best Split Node Impurity
152
-
Tree Construction ()
All the training examples are at the root.
Tree Pruning () Data Noise branch
Tree Induction ( ) Greedy Strategy
Split the records based on an attribute test that optimizes certain criterion.
Issues Determine how to split the records
How to specify the attribute test conditions?
How to determine the best split?
Determine when to stop splitting
153
-
Node Impurity
Information Gain Entropy
ID3
Gain Ratio IG Splitinfo
C4.5
Gini Binary split
CART
154
-
Entropy = S impurity
S; a set of exmples
p; positive example
q; negative example
Gain(T,X)
= Entropy(T) Entropy(T,X)
155
-
Overfitting Prepruning
Tree construction (threshold) goodness measure
Postpruning
"Fully grown" tree branch get a sequence of progressively pruned trees
Training data "best pruned tree"
156
-
157
-
Information Gain Best Predictor?
158
-
Decision Tree Root node
159
-
Tree Rule
160
-
Decision Tree Regression
161
-
Entropy vs.
162
-
R
C5.0
163
-
KNN (K-Nearest Neighbors)
KNN classifies unlabeled examples based on their similarity
with examples in the training set
xu D , find the k closest labeled examples
in the training data set and assign xu to the class that appears most frequently within the k-subset
k-NNR only requires
k
A set of labeled examples (training data)
closeness
training dataset
, Unlabeled example test dataset .
164
-
KNN algorithm feature (feature space) 2-/3-/4-dimensional
165
-
166
-
k balance between overfitting and uderfitting.
k ; noisy data (, risk of ignoring small but important pattern
k ; noisy data . , accidentally mislabeled item
167
-
k 1-Nearest Neighbor
3-Nearest Neighbor
knn Min-max normalization
Z-score standardization
-
Voronoi Diagram
Decision surface formed by the training examples
169
-
Rescaling
Min-max
Z-score 170
-
R (KNN):
: /
1
2 /
3
4
5
10
(Radius, Texture, Perimeter, Area, Smoothness, ...)
171
-
R coding
172
-
II.
Clustering
: K-Means
R : Clustering
173
-
Cluster
class object object object
Clustering
class grouping .
Group (partition)
Group label unsupervised ML group .
actionable insight meaningful label !
174
-
Clustering
Scalability
attribute (// )
Attribute shape
High dimensionality, Noisy data , Interpretability
: ,
, , pattern
: taxonomy, ,
175
-
: infer specialty by examining their research publications
176
-
K Means
The k-means algorithm for clustering
initial assignment phase
k initial cluster center , initial guess locally optimal solution cluster .
3 cluster k=3
Update phase
Initial center new location shift (centroid) example cluster
177
-
178
-
: (i) cluster .
(ii) coordinates of the cluster centroids.
k (= cluster ) balance : overfitting vs. underfitting
a priori knowledge (= a priori belief)
randomly, business requirement , sqrt(n/2)
large dataset elbow point
179
-
Clustering
Partitioning Method
partition (k) (partitioning) , iteratively relocate (Object )
Density-based Method
(threshold) cluster
, for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points.
180
-
Hierarchical Method Agglomerative approach
bottom-up Group merging, until termination condition holds.
Divisive approach
top-down cluster split, until termination condition holds.
181
-
III.
Apriori
R coding
182
-
.
BreadMilk [sup = 5%, conf = 100%]
t (= a set of items) I.
I = {i1, i2, , im}: a set of items.
Trxn Database T = {t1, t2, , tn}.
: Market basket transactions: t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
tn: {biscuit, eggs, milk}
183
-
Rule (Support)
, Pr(A B) (1 )
Support=( A, B )/
('A=>B' = 'B=>A' )
2. (Confidence)
A , B Pr(B|A)
= (A, B )/(A )
'A=>B' 'B=>A'
3. (Improve)
= (A, B * )/ (A *A )
1 1 , 1 .
184
-
() .
, A B .
. , A B
185
n
countYXsupport
). (
countX
countYXconfidence
.
). (
-
Find all rules that satisfy the user-specified minimum
support (minsup) and minimum confidence (minconf).
Features Completeness: find all rules.
No target item(s) on the right-hand-side
Mining with data on hard disk (not in memory)
!
the Apriori Algorithm
186
-
Apriori
: (1 ) minimum support itemset
(frequent itemsets large itemsets).
(2 ) Use frequent itemsets to generate rules.
: (frequent itemset) {Chicken, Clothes, Milk} [sup = 3/7]
and one rule from the frequent itemset Clothes Milk, Chicken [sup = 3/7, conf = 3/3]
187
-
1: Mining (frequent itemset)
= an itemset whose support is minsup.
:
apriori property (downward closure property): any subsets of a frequent itemset are also frequent itemsets
188
AB AC AD BC BD CD
A B C D
ABC ABD ACD BCD
-
2: rule Frequent itemsets association rules
One more step is needed to generate association rules
For each frequent itemset X,
For each proper nonempty subset A of X,
Let B = X - A
A B is an association rule if
Confidence(A B) minconf, support(AB) = support(A B) = support(X) confidence(AB) = support(A B) / support(A)
189
-
R coding
190
-
R (2) CLICKSTREAM PROFILING
191
-
Clickstream
Clickstream Data Warehouse by Mark Sweiger
Schema
-
Identify who you are from where you go
Click path
Web log
Page Tagging
: Google Analytics
Internet Traffic
Google Analytics. Yandex, Kontagent
Crowd-sourcing
-
Quiz: Which one is a more frustrated?
-
(Path Analysis)
Choice model of Browsing
Text
Markov
-
Path Analysis
-
Page
User Session
-
Probability of Viewing a Page
Transition Matrix
-
Predicting Purchase Conversion
-
Profiling:
Data Vectorization!!!
-
Clustering, SVM,
-
Page Contents = HTML Code + Regular Text
-
Tokenization & Lexical Parsing
HTML code,
, Stop word ,
term frequency (TF)
Result: Document Vector
-
Classifying Document Vector
-
Markov Chain
: Auto Insurance Risk :
low risk or high risk - 12
:
high risker 60% chance high risk
high risker 40% chance low risk.
low risker 15% chance high risk
low risker 85% chance low risk.
Task:
Set up a probability tree, transition diagram, and transition matrix to our process.
-
Matrix (time, sequence, trials, etc.)
probability tree
same state state .
.
mutual exclusive
Task
Find the probability of being in any given state may steps into the process.
-
Random guessing: 7%
Text Classification: 25%
+ Domain Model 41%
+ Browsing Model 78%
Source: http://www.andrew.cmu.edu/user/alm3/
http://www.andrew.cmu.edu/user/alm3/
-
Profiling
()
213
-
(churn analysis) (app)
-
215
-
Churn Rate
Most of the Apps Lose Half of their Peak Users within 3 Months
-
churn analysis
-
Business Objective: Reduce Customer Churn
Solution #1 .
Solution #2 .
Action Plans
App (eg. Gaming App, Social App)
List down the activities that users perform on your app
core feature
average life-time
-
Churn Criteria
Cut-off date
= app ~
: A app 2014531 inactive Cut-off Date 40 days
data points: app activities
app
app
core feature
-
1
2
3
(Preprocessing)
-
Variable (Feature)
-
Google Analytics R
Image source: Google Analytics Core Reporting API Dev Guide
-
app
-
, Classification Problem
Logistic Regression
Predictor(dependent) variable will be unique key(Visitor ID) for each visitors
Predicted label would be
1 : Visitor will churn vs. 0 : Visitor would not churn
-
Process
Random Train Test
Train Data-set
Test Data-set
Test Data
-
(Accuracy)
Confusion Matrix
Accuracy
= (No of Correctly Predicted Labels) / Total No of Labels
= (620 + 1024)/ (620 + 4 + 7 + 1024)
~ 99.34 %
-
User Segmentation
-
User types
-
(Market Segmentation using k-means)
-
Market Matching
Segmentation
dissecting the marketplace into submarkets that require different marketing mixes
Targeting
Process of reviewing market segments and deciding which one(s) to pursue
Positioning
Establishing a differentiating image for a product or service in relation to its competition
230
-
SNS 10
231
Segmentation
Geographic
Demographic Psychographic
Behavioral Geodemographic
-
R coding
232
-
R (3)
233
-
?
Non-Math/Stats Model
Representation of Some Phenomenon
Math/Stats Model
Describe Relationship between Variables
(Deterministic) Models (no randomness)
(Probabilistic) (with randomness)
234
-
Deterministic Models (no randomness)
Hypothesize Exact Relationships
Prediction Error
: (Body mass index: BMI)
BMI = Weight in Kilograms/ (Height in Meters)2
(with randomness) Hypothesize 2 Components
Deterministic
Random Error
: (Systolic blood pressure)
SBP = 6 x age(d) +
, Random Error (: Birthweight)
235
-
Probabilistic models
Regression Models
Corrleation models
Other models
236
-
(Regression) ( ) ()
Use equation to set up relationship
Numerical Dependent (Response) Variable
1 or More Numerical or Categorical Independent (Explanatory) Variables
1. Hypothesize Deterministic Component
Estimate Unknown Parameters
2. Random Error Term
Estimate Standard Deviation of Error
3. Fitted Model
4. Use Model for Prediction & Estimation
237
-
Specifying the deterministic component
1.
2. (Hypothesize Nature of Relationship)
Expected Effects (i.e., Coefficients Signs)
Functional Form (Linear or Non-Linear)
Interactions
1. (: Epidemiology)
2.
3. (Previous Research)
4. Common Sense
238
-
: Which model is more logical?
239
Years since seroconversion
CD+ counts
CD+ counts
Years since seroconversion
Years since seroconversion
Years since seroconversion
CD+ counts
CD+ counts
-
Regression
240
Regression
(Simple)
(Multiple)
2 1
-
Linear Equation
241
-
R coding
(Simple Linear Regression)
lm()
: coef()
(fitted value): fitted()
(residual): residual()
: confint()
predict()
predict.glm(), predict.lm(), predict.nls()
summary()
F
ANOVA 242
-
(Multiple Linear Regression)
n : I()
(outlier)
243
-
R class ts
frequency=7: a weekly series
frequency=12: a monthly series
frequency=4: a quarterly series a
-
Time Series Decomposition
4 : Trend component: long term trend
Seasonal component: seasonal variation
Cyclical component: repeated but non-periodic fluctuations
Irregular component: the residuals
: AirPassengers : plot(AirPassengers)
apts
-
Popular models
Autoregressive moving average (ARMA)
Autoregressive integrated moving average (ARIMA)
# build an ARIMA model
fit
-
247
-
R
R / R
R
R
248
-
R
~ 1 Million records : R 1M ~ 1Billion : tuning >= 1Billion records: MapReduce
(: 10 K record hierarchical clustering 50 M )
R R object Sampling H/W upgrade: 64-bit 8TB RAM interpreter (c/c++ ) R --> Parallel R
249
-
R MapReduce -
R Hadoop job (map) .
Join, , sort reduce $ export HADOOP_HOME = /usr/lib/hadoop
$ ${HADOOP_HOME}/bin/hadoop fs -rmr output
$ ${HADOOP_HOME}/bin/hadoop fs
-put test-data/stocks.txt stocks.txt
$ ${HADOOP_HOME}/bin/hadoop \
jar ${HADOOP_HOME}/contrib/streaming/*.jar \
-D mapreduce.job.reduces=0 \
-inputformat org.apache.hadoop.mapred.TextInputformat \
- input stocks.txt \
- output output \
- mapper `pwd'/src/main/test/stock_day_avg.R \
- file `pwd`//src/main/test/stock_day_avg.R 250
-
R Map Reduce R
()
R .
$ cat test-data/stocks.txt | src/main/test/stock_day_avg.R |
sort --key 1,1 | src/main/test/stock_cma.R
Hadoop job .
251
-
R
Visualization
252
-
R ()
Shiny R framework
R Application
http://shiny.rstudio.com/ ( http://shiny.rstudio.com/gallery/ )
, Professional ()
253
-
!!
Data as a Strategic Value
Data Science
, , ,
254
-
255