商業智慧實務 (practices of business...
TRANSCRIPT
商業智慧實務 Practices of Business Intelligence
1
1032BI05 MI4
Wed, 9,10 (16:10-18:00) (B130)
商業智慧的資料探勘 (Data Mining for Business Intelligence)
Min-Yuh Day
戴敏育 Assistant Professor
專任助理教授 Dept. of Information Management, Tamkang University
淡江大學 資訊管理學系
http://mail. tku.edu.tw/myday/
2015-03-25
Tamkang University
週次 (Week) 日期 (Date) 內容 (Subject/Topics)
1 2015/02/25 商業智慧導論 (Introduction to Business Intelligence)
2 2015/03/04 管理決策支援系統與商業智慧 (Management Decision Support System and Business Intelligence)
3 2015/03/11 企業績效管理 (Business Performance Management)
4 2015/03/18 資料倉儲 (Data Warehousing)
5 2015/03/25 商業智慧的資料探勘 (Data Mining for Business Intelligence)
6 2015/04/01 教學行政觀摩日 (Off-campus study)
7 2015/04/08 商業智慧的資料探勘 (Data Mining for Business Intelligence)
8 2015/04/15 資料科學與巨量資料分析 (Data Science and Big Data Analytics)
課程大綱 (Syllabus)
2
週次 日期 內容(Subject/Topics)
9 2015/04/22 期中報告 (Midterm Project Presentation)
10 2015/04/29 期中考試週 (Midterm Exam)
11 2015/05/06 文字探勘與網路探勘 (Text and Web Mining)
12 2015/05/13 意見探勘與情感分析 (Opinion Mining and Sentiment Analysis)
13 2015/05/20 社會網路分析 (Social Network Analysis)
14 2015/05/27 期末報告 (Final Project Presentation)
15 2015/06/03 畢業考試週 (Final Exam)
課程大綱 (Syllabus)
3
Business Intelligence Data Mining, Data Warehouses
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Decision Making
Data Presentation
Visualization Techniques
Data Mining Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
4 Source: Han & Kamber (2006)
Data Mining at the Intersection of Many Disciplines
Sta
tistic
s
Management Science &
Information Systems
Artificia
l Inte
lligence
Databases
Pattern
Recognition
Machine
Learning
Mathematical
Modeling
DATA
MINING
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 5
A Taxonomy for Data Mining Tasks Data Mining
Prediction
Classification
Regression
Clustering
Association
Link analysis
Sequence analysis
Learning Method Popular Algorithms
Supervised
Supervised
Supervised
Unsupervised
Unsupervised
Unsupervised
Unsupervised
Decision trees, ANN/MLP, SVM, Rough
sets, Genetic Algorithms
Linear/Nonlinear Regression, Regression
trees, ANN/MLP, SVM
Expectation Maximization, Apriory
Algorithm, Graph-based Matching
Apriory Algorithm, FP-Growth technique
K-means, ANN/SOM
Outlier analysis Unsupervised K-means, Expectation Maximization (EM)
Apriory, OneR, ZeroR, Eclat
Classification and Regression Trees,
ANN, SVM, Genetic Algorithms
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 6
Data Mining Software
• Commercial – SPSS - PASW (formerly
Clementine)
– SAS - Enterprise Miner
– IBM - Intelligent Miner
– StatSoft – Statistical Data Miner
– … many more
• Free and/or Open Source – Weka
– RapidMiner…
0 20 40 60 80 100 120
Thinkanalytics
Miner3D
Clario Analytics
Viscovery
Megaputer
Insightful Miner/S-Plus (now TIBCO)
Bayesia
C4.5, C5.0, See5
Angoss
Orange
Salford CART, Mars, other
Statsoft Statistica
Oracle DM
Zementis
Other free tools
Microsoft SQL Server
KNIME
Other commercial tools
MATLAB
KXEN
Weka (now Pentaho)
Your own code
R
Microsoft Excel
SAS / SAS Enterprise Miner
RapidMiner
SPSS PASW Modeler (formerly Clementine)
Total (w/ others) Alone
Source: KDNuggets.com, May 2009
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems
Data Mining Process
• A manifestation of best practices
• A systematic way to conduct DM projects
• Different groups has different versions
• Most common standard processes:
– CRISP-DM (Cross-Industry Standard Process for Data Mining)
– SEMMA (Sample, Explore, Modify, Model, and Assess)
– KDD (Knowledge Discovery in Databases)
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 8
Data Mining Process: CRISP-DM
Data Sources
Business
Understanding
Data
Preparation
Model
Building
Testing and
Evaluation
Deployment
Data
Understanding
6
1 2
3
5
4
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 9
Data Mining Process: CRISP-DM
Step 1: Business Understanding
Step 2: Data Understanding
Step 3: Data Preparation (!)
Step 4: Model Building
Step 5: Testing and Evaluation
Step 6: Deployment
• The process is highly repetitive and experimental (DM: art versus science?)
Accounts for
~85% of total
project time
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 10
Data Preparation – A Critical DM Task
Data Consolidation
Data Cleaning
Data Transformation
Data Reduction
Well-formed
Data
Real-world
Data
· Collect data
· Select data
· Integrate data
· Impute missing values
· Reduce noise in data
· Eliminate inconsistencies
· Normalize data
· Discretize/aggregate data
· Construct new attributes
· Reduce number of variables
· Reduce number of cases
· Balance skewed data
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 11
Data Mining Process: SEMMA
Sample
(Generate a representative
sample of the data)
Modify(Select variables, transform
variable representations)
Explore(Visualization and basic
description of the data)
Model(Use variety of statistical and
machine learning models )
Assess(Evaluate the accuracy and
usefulness of the models)
SEMMA
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 12
Evaluation
(Accuracy of Classification Model)
13
Estimation Methodologies for Classification
• Simple split (or holdout or test sample estimation)
– Split the data into 2 mutually exclusive sets training (~70%) and testing (30%)
– For ANN, the data is split into three sub-sets (training [~60%], validation [~20%], testing [~20%])
Preprocessed
Data
Training Data
Testing Data
Model
Development
Model
Assessment
(scoring)
2/3
1/3
Classifier
Prediction
Accuracy
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 14
Estimation Methodologies for Classification
• k-Fold Cross Validation (rotation estimation)
– Split the data into k mutually exclusive subsets
– Use each subset as testing while using the rest of the subsets as training
– Repeat the experimentation for k times
– Aggregate the test results for true estimation of prediction accuracy training
• Other estimation methodologies
– Leave-one-out, bootstrapping, jackknifing
– Area under the ROC curve
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 15
Assessment Methods for Classification
• Predictive accuracy
– Hit rate
• Speed
– Model building; predicting
• Robustness
• Scalability
• Interpretability
– Transparency, explainability
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 16
Accuracy
Precision
17
Validity
Reliability
18
Accuracy vs. Precision
19
High Accuracy
High Precision
High Accuracy
Low Precision
Low Accuracy
High Precision
Low Accuracy
Low Precision
A B
C D
Accuracy vs. Precision
20
High Accuracy
High Precision
High Accuracy
Low Precision
Low Accuracy
High Precision
Low Accuracy
Low Precision
A B
C D
High Validity
High Reliability
High Validity
Low Reliability Low Validity
Low Reliability
Low Validity
High Reliability
Accuracy vs. Precision
21
High Accuracy
High Precision
High Accuracy
Low Precision
Low Accuracy
High Precision
Low Accuracy
Low Precision
A B
C D
High Validity
High Reliability
High Validity
Low Reliability Low Validity
Low Reliability
Low Validity
High Reliability
Accuracy of Classification Models • In classification problems, the primary source for
accuracy estimation is the confusion matrix
True
Positive
Count (TP)
False
Positive
Count (FP)
True
Negative
Count (TN)
False
Negative
Count (FN)
True Class
Positive Negative
Po
sitiv
eN
eg
ative
Pre
dic
ted
Cla
ss FNTP
TPRatePositiveTrue
FPTN
TNRateNegativeTrue
FNFPTNTP
TNTPAccuracy
FPTP
TPrecision
P
FNTP
TPcallRe
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 22
Estimation Methodologies for Classification – ROC Curve
10.90.80.70.60.50.40.30.20.10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1
0.9
0.8
False Positive Rate (1 - Specificity)
Tru
e P
osi
tive
Ra
te (
Se
nsi
tivity
) A
B
C
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 23
Sensitivity
Specificity
24
=True Positive Rate
=True Negative Rate
True Positive
(TP)
False Negative
(FN)
False Positive
(FP)
True Negative
(TN)
True Class
(actual value)
Pre
dic
tive C
lass
(pre
dic
tio
n o
utc
om
e)
Positiv
e
Nega
tive
Positive Negative
total P
total
N
N’
P’
25
FNTP
TPRatePositiveTrue
FPTN
TNRateNegativeTrue
FNFPTNTP
TNTPAccuracy
FPTP
TPrecision
P
FNTP
TPcallRe
FNTP
TPRatePositiveTrue
ty)(Sensitivi
10.90.80.70.60.50.40.30.20.10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1
0.9
0.8
False Positive Rate (1 - Specificity)
Tru
e P
osi
tive
Ra
te (
Se
nsi
tivity
) A
B
C
TNFP
FPRatePositiveF
alse
FPTN
TNRateNegativeTrue
ty)(Specifici
TNFP
FPRatePositiveF
y)Specificit-(1 alse
Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic
True Positive
(TP)
False Negative
(FN)
False Positive
(FP)
True Negative
(TN)
True Class
(actual value)
Pre
dic
tive C
lass
(pre
dic
tio
n o
utc
om
e)
Positiv
e
Nega
tive
Positive Negative
total P
total
N
N’
P’
26
FNTP
TPRatePositiveTrue
FNTP
TPcallRe
FNTP
TPRatePositiveTrue
ty)(Sensitivi
10.90.80.70.60.50.40.30.20.10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1
0.9
0.8
False Positive Rate (1 - Specificity)
Tru
e P
osi
tive
Ra
te (
Se
nsi
tivity
) A
B
C
Sensitivity
= True Positive Rate
= Recall
= Hit rate
= TP / (TP + FN) Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic
True Positive
(TP)
False Negative
(FN)
False Positive
(FP)
True Negative
(TN)
True Class
(actual value)
Pre
dic
tive C
lass
(pre
dic
tio
n o
utc
om
e)
Positiv
e
Nega
tive
Positive Negative
total P
total
N
N’
P’
27
FPTN
TNRateNegativeTrue
10.90.80.70.60.50.40.30.20.10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1
0.9
0.8
False Positive Rate (1 - Specificity)
Tru
e P
osi
tive
Ra
te (
Se
nsi
tivity
) A
B
C
FPTN
TNRateNegativeTrue
ty)(Specifici
TNFP
FPRatePositiveF
y)Specificit-(1 alse
Specificity
= True Negative Rate
= TN / N
= TN / (TN+ FP)
Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic
True Positive
(TP)
False Negative
(FN)
False Positive
(FP)
True Negative
(TN)
True Class
(actual value)
Pre
dic
tive C
lass
(pre
dic
tio
n o
utc
om
e)
Positiv
e
Nega
tive
Positive Negative
total P
total
N
N’
P’
28
FPTP
TPrecision
P
FNTP
TPcallRe
F1 score (F-score)(F-measure)
is the harmonic mean of
precision and recall
= 2TP / (P + P’)
= 2TP / (2TP + FP + FN)
Precision
= Positive Predictive Value (PPV)
Recall
= True Positive Rate (TPR)
= Sensitivity
= Hit Rate
recallprecision
recallprecisionF
**2
Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic
29 Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic
A
63 (TP)
37 (FN)
28 (FP)
72 (TN)
100 100
109
91
200
TPR = 0.63
FPR = 0.28
PPV = 0.69 =63/(63+28)
=63/91
F1 = 0.66
= 2*(0.63*0.69)/(0.63+0.69)
= (2 * 63) /(100 + 91)
= (0.63 + 0.69) / 2 =1.32 / 2 =0.66
ACC = 0.68 = (63 + 72) / 200
= 135/200 = 0.675
FPTP
TPrecision
P
FNTP
TPcallRe
F1 score (F-score)
(F-measure)
is the harmonic mean of
precision and recall
= 2TP / (P + P’)
= 2TP / (2TP + FP + FN)
Precision
= Positive Predictive Value (PPV)
Recall
= True Positive Rate (TPR)
= Sensitivity
= Hit Rate
= TP / (TP + FN)
recallprecision
recallprecisionF
**2
FNFPTNTP
TNTPAccuracy
FPTN
TNRateNegativeTrue
ty)(Specifici
TNFP
FPRatePositiveF
y)Specificit-(1 alse
Specificity
= True Negative Rate
= TN / N
= TN / (TN + FP)
30 Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic
A
63 (TP)
37 (FN)
28 (FP)
72 (TN)
100 100
109
91
200
TPR = 0.63
FPR = 0.28
PPV = 0.69 =63/(63+28)
=63/91
F1 = 0.66
= 2*(0.63*0.69)/(0.63+0.69)
= (2 * 63) /(100 + 91)
= (0.63 + 0.69) / 2 =1.32 / 2 =0.66
ACC = 0.68 = (63 + 72) / 200
= 135/200 = 0.675
B
77 (TP)
23 (FN)
77 (FP)
23 (TN)
100 100
46
154
200
TPR = 0.77
FPR = 0.77
PPV = 0.50
F1 = 0.61
ACC = 0.50
FNTP
TPcallRe
Recall
= True Positive Rate (TPR)
= Sensitivity
= Hit Rate
Precision
= Positive Predictive Value (PPV) FPTP
TPrecision
P
31
C’
76 (TP)
24 (FN)
12 (FP)
88 (TN)
100 100
112
88
200
TPR = 0.76
FPR = 0.12
PPV = 0.86
F1 = 0.81
ACC = 0.82
C
24 (TP)
76 (FN)
88 (FP)
12 (TN)
100 100
88
112
200
TPR = 0.24
FPR = 0.88
PPV = 0.21
F1 = 0.22
ACC = 0.18
Source: http://en.wikipedia.org/wiki/Receiver_operating_characteristic
FNTP
TPcallRe
Recall
= True Positive Rate (TPR)
= Sensitivity
= Hit Rate
Precision
= Positive Predictive Value (PPV) FPTP
TPrecision
P
Cluster Analysis
• Used for automatic identification of natural groupings of things
• Part of the machine-learning family
• Employ unsupervised learning
• Learns the clusters of things from past data, then assigns new instances
• There is not an output variable
• Also known as segmentation
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 32
Cluster Analysis
33
Clustering of a set of objects based on the k-means method.
(The mean of each cluster is marked by a “+”.)
Source: Han & Kamber (2006)
Cluster Analysis
• Clustering results may be used to
– Identify natural groupings of customers
– Identify rules for assigning new cases to classes for targeting/diagnostic purposes
– Provide characterization, definition, labeling of populations
– Decrease the size and complexity of problems for other data mining methods
– Identify outliers in a specific domain (e.g., rare-event detection)
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 34
35
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Point P P(x,y)
p01 a (3, 4)
p02 b (3, 6)
p03 c (3, 8)
p04 d (4, 5)
p05 e (4, 7)
p06 f (5, 1)
p07 g (5, 5)
p08 h (7, 3)
p09 i (7, 5)
p10 j (8, 5)
Example of Cluster Analysis
Cluster Analysis for Data Mining
• How many clusters?
– There is not a “truly optimal” way to calculate it
– Heuristics are often used 1. Look at the sparseness of clusters
2. Number of clusters = (n/2)1/2 (n: no of data points)
3. Use Akaike information criterion (AIC)
4. Use Bayesian information criterion (BIC)
• Most cluster analysis methods involve the use of a distance measure to calculate the closeness between pairs of items
– Euclidian versus Manhattan (rectilinear) distance
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 36
k-Means Clustering Algorithm
• k : pre-determined number of clusters
• Algorithm (Step 0: determine value of k)
Step 1: Randomly generate k random points as initial cluster centers
Step 2: Assign each point to the nearest cluster center
Step 3: Re-compute the new cluster centers
Repetition step: Repeat steps 2 and 3 until some convergence criterion is met (usually that the assignment of points to clusters becomes stable)
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 37
Cluster Analysis for Data Mining - k-Means Clustering Algorithm
Step 1 Step 2 Step 3
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 38
Similarity and Dissimilarity Between Objects
• Distances are normally used to measure the similarity or
dissimilarity between two data objects
• Some popular ones include: Minkowski distance:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
pp
jx
ix
jx
ix
jx
ixjid )||...|||(|),(
2211
||...||||),(2211 pp j
xi
xj
xi
xj
xi
xjid
39 Source: Han & Kamber (2006)
Similarity and Dissimilarity Between Objects (Cont.)
• If q = 2, d is Euclidean distance:
– Properties
• d(i,j) 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j) d(i,k) + d(k,j)
• Also, one can use weighted distance, parametric Pearson
product moment correlation, or other disimilarity measures
)||...|||(|),(22
22
2
11 pp jx
ix
jx
ix
jx
ixjid
40 Source: Han & Kamber (2006)
Euclidean distance vs Manhattan distance
• Distance of two point x1 = (1, 2) and x2 (3, 5)
41
1 2 3
1
2
3
4
5 x2 (3, 5)
2 x1 = (1, 2)
3 3.61
Euclidean distance:
= ((3-1)2 + (5-2)2 )1/2
= (22 + 32)1/2
= (4 + 9)1/2
= (13)1/2
= 3.61
Manhattan distance:
= (3-1) + (5-2)
= 2 + 3
= 5
The K-Means Clustering Method
• Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K
object as initial
cluster center
Assign
each
objects
to most
similar
center
Update
the
cluster
means
Update
the
cluster
means
reassign reassign
42 Source: Han & Kamber (2006)
43
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Point P P(x,y)
p01 a (3, 4)
p02 b (3, 6)
p03 c (3, 8)
p04 d (4, 5)
p05 e (4, 7)
p06 f (5, 1)
p07 g (5, 5)
p08 h (7, 3)
p09 i (7, 5)
p10 j (8, 5)
K-Means Clustering Step by Step
44
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Point P P(x,y)
p01 a (3, 4)
p02 b (3, 6)
p03 c (3, 8)
p04 d (4, 5)
p05 e (4, 7)
p06 f (5, 1)
p07 g (5, 5)
p08 h (7, 3)
p09 i (7, 5)
p10 j (8, 5)
Initial m1 (3, 4)
Initial m2 (8, 5)
m1 = (3, 4)
M2 = (8, 5)
K-Means Clustering Step 1: K=2, Arbitrarily choose K object as initial cluster center
45
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
M2 = (8, 5)
Step 2: Compute seed points as the centroids of the clusters of the current partition
Step 3: Assign each objects to most similar center
m1 = (3, 4)
K-Means Clustering
Point P P(x,y) m1
distance
m2
distance Cluster
p01 a (3, 4) 0.00 5.10 Cluster1
p02 b (3, 6) 2.00 5.10 Cluster1
p03 c (3, 8) 4.00 5.83 Cluster1
p04 d (4, 5) 1.41 4.00 Cluster1
p05 e (4, 7) 3.16 4.47 Cluster1
p06 f (5, 1) 3.61 5.00 Cluster1
p07 g (5, 5) 2.24 3.00 Cluster1
p08 h (7, 3) 4.12 2.24 Cluster2
p09 i (7, 5) 4.12 1.00 Cluster2
p10 j (8, 5) 5.10 0.00 Cluster2
Initial m1 (3, 4)
Initial m2 (8, 5)
46
Point P P(x,y) m1
distance
m2
distance Cluster
p01 a (3, 4) 0.00 5.10 Cluster1
p02 b (3, 6) 2.00 5.10 Cluster1
p03 c (3, 8) 4.00 5.83 Cluster1
p04 d (4, 5) 1.41 4.00 Cluster1
p05 e (4, 7) 3.16 4.47 Cluster1
p06 f (5, 1) 3.61 5.00 Cluster1
p07 g (5, 5) 2.24 3.00 Cluster1
p08 h (7, 3) 4.12 2.24 Cluster2
p09 i (7, 5) 4.12 1.00 Cluster2
p10 j (8, 5) 5.10 0.00 Cluster2
Initial m1 (3, 4)
Initial m2 (8, 5)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
M2 = (8, 5)
Step 2: Compute seed points as the centroids of the clusters of the current partition
Step 3: Assign each objects to most similar center
m1 = (3, 4)
K-Means Clustering
Euclidean distance
b(3,6) m2(8,5)
= ((8-3)2 + (5-6)2 )1/2
= (52 + (-1)2)1/2
= (25 + 1)1/2
= (26)1/2
= 5.10
Euclidean distance
b(3,6) m1(3,4)
= ((3-3)2 + (4-6)2 )1/2
= (02 + (-2)2)1/2
= (0 + 4)1/2
= (4)1/2
= 2.00
47
Point P P(x,y) m1
distance
m2
distance Cluster
p01 a (3, 4) 1.43 4.34 Cluster1
p02 b (3, 6) 1.22 4.64 Cluster1
p03 c (3, 8) 2.99 5.68 Cluster1
p04 d (4, 5) 0.20 3.40 Cluster1
p05 e (4, 7) 1.87 4.27 Cluster1
p06 f (5, 1) 4.29 4.06 Cluster2
p07 g (5, 5) 1.15 2.42 Cluster1
p08 h (7, 3) 3.80 1.37 Cluster2
p09 i (7, 5) 3.14 0.75 Cluster2
p10 j (8, 5) 4.14 0.95 Cluster2
m1 (3.86, 5.14)
m2 (7.33, 4.33)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
m1 = (3.86, 5.14)
M2 = (7.33, 4.33)
Step 4: Update the cluster means,
Repeat Step 2, 3,
stop when no more new assignment
K-Means Clustering
48
Point P P(x,y) m1
distance
m2
distance Cluster
p01 a (3, 4) 1.95 3.78 Cluster1
p02 b (3, 6) 0.69 4.51 Cluster1
p03 c (3, 8) 2.27 5.86 Cluster1
p04 d (4, 5) 0.89 3.13 Cluster1
p05 e (4, 7) 1.22 4.45 Cluster1
p06 f (5, 1) 5.01 3.05 Cluster2
p07 g (5, 5) 1.57 2.30 Cluster1
p08 h (7, 3) 4.37 0.56 Cluster2
p09 i (7, 5) 3.43 1.52 Cluster2
p10 j (8, 5) 4.41 1.95 Cluster2
m1 (3.67, 5.83)
m2 (6.75, 3.50)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
M2 = (6.75., 3.50)
m1 = (3.67, 5.83)
Step 4: Update the cluster means,
Repeat Step 2, 3,
stop when no more new assignment
K-Means Clustering
49
Point P P(x,y) m1
distance
m2
distance Cluster
p01 a (3, 4) 1.95 3.78 Cluster1
p02 b (3, 6) 0.69 4.51 Cluster1
p03 c (3, 8) 2.27 5.86 Cluster1
p04 d (4, 5) 0.89 3.13 Cluster1
p05 e (4, 7) 1.22 4.45 Cluster1
p06 f (5, 1) 5.01 3.05 Cluster2
p07 g (5, 5) 1.57 2.30 Cluster1
p08 h (7, 3) 4.37 0.56 Cluster2
p09 i (7, 5) 3.43 1.52 Cluster2
p10 j (8, 5) 4.41 1.95 Cluster2
m1 (3.67, 5.83)
m2 (6.75, 3.50)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
stop when no more new assignment
K-Means Clustering
K-Means Clustering (K=2, two clusters)
50
Point P P(x,y) m1
distance
m2
distance Cluster
p01 a (3, 4) 1.95 3.78 Cluster1
p02 b (3, 6) 0.69 4.51 Cluster1
p03 c (3, 8) 2.27 5.86 Cluster1
p04 d (4, 5) 0.89 3.13 Cluster1
p05 e (4, 7) 1.22 4.45 Cluster1
p06 f (5, 1) 5.01 3.05 Cluster2
p07 g (5, 5) 1.57 2.30 Cluster1
p08 h (7, 3) 4.37 0.56 Cluster2
p09 i (7, 5) 3.43 1.52 Cluster2
p10 j (8, 5) 4.41 1.95 Cluster2
m1 (3.67, 5.83)
m2 (6.75, 3.50)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
stop when no more new assignment
K-Means Clustering
個案分析與實作一 (SAS EM 分群分析): Case Study 1 (Cluster Analysis – K-Means using SAS EM)
Banking Segmentation
51
行銷客戶分群
52
53
案例情境 • ABC銀行的行銷部門想要針對該銀行客戶的使用行為,進行分群分析,以了解現行客戶對本行的往來方式,並進一步提供適宜的行銷接觸模式。
• 該銀行從有效戶(近三個月有交易者), 取出10萬筆樣本資料。 依下列四種交易管道計算交易次數:
–傳統臨櫃交易(TBM)
–自動櫃員機交易(ATM)
–銀行專員服務(POS)
–電話客服(CSC) Source: SAS Enterprise Miner Course Notes, 2014, SAS
54
資料欄位說明 • 資料集名稱: profile.sas7bdat
Source: SAS Enterprise Miner Course Notes, 2014, SAS
55
• 分析目的
依據各往來交易管道TBM、ATM、POS、CSC進行客戶分群分析。
演練重點:
• 極端值資料處理 • 分群變數選擇 • 衍生變數產出 • 分群參數調整與分群結果解釋
行銷客戶分群實機演練
Source: SAS Enterprise Miner Course Notes, 2014, SAS
SAS Enterprise Miner (SAS EM) Case Study
• SAS EM 資料匯入4步驟
– Step 1. 新增專案 (New Project)
– Step 2. 新增資料館 (New / Library)
– Step 3. 建立資料來源 (Create Data Source)
– Step 4. 建立流程圖 (Create Diagram)
• SAS EM SEMMA 建模流程
56
57
Download EM_Data.zip (SAS EM Datasets)
http://mail.tku.edu.tw/myday/teaching/1022/DM/Data/EM_Data.zip
Upzip EM_Data.zip to C:\DATA\EM_Data
58
Upzip EM_Data.zip to C:\DATA\EM_Data
59
VMware Horizon View Client softcloud.tku.edu.tw SAS Enterprise Miner
60
SAS Locale Setup Manager English UI
61
SAS Enterprise Guide 5.1 (SAS EG)
62
SAS Enterprise Miner 12.1 (SAS EM)
63
SAS Locale Setup Manager 3.1
64
65 Source: http://www.sasresource.com/faq11.html
66
C:\Program Files\SASHome\SASEnterpriseMinerWorkstationConfiguration\12.1\em.exe
67
68
69
SAS Enterprise Miner 12.1 (SAS EM)
70
SAS Enterprise Miner (SAS EM)
71
SAS Enterprise Miner (SAS EM) English UI
SAS Enterprise Guide 5.1 (SAS EG)
Open SAS .sas7bdat File Export to Excel .xlsx File
Import Excel .xlsx File Export to SAS .sas7bdat File
72
SAS Enterprise Guide 5.1 (SAS EG)
73
New Project
74
Open SAS Data File: Profile.sas7bdat
75
76
77
78
79
80
81
82
83
84
85
86
Export SAS .sas7bdat to to Excel .xlsx File
Import Excel File to SAS EG
87
88
Import Excel File to SAS EG
89
90
91
92
93
94
95
Export Excel .xlsx File to SAS .sas7bdat File
96
97
98
99
Profile_Excel.xlsx
100
Profile_SAS.sas7bdat
101
SAS Enterprise Miner 12.1 (SAS EM)
102
SAS EM 資料匯入4步驟
• Step 1. 新增專案 (New Project)
• Step 2. 新增資料館 (New / Library)
• Step 3. 建立資料來源 (Create Data Source)
• Step 4. 建立流程圖 (Create Diagram)
103
Step 1. 新增專案 (New Project)
104
105
Step 1. 新增專案 (New Project)
106
Step 1. 新增專案 (New Project)
SAS Enterprise Miner (EM_Project1)
107
Step 2. 新增資料館 (New / Library)
108
109
Step 2. 新增資料館 (New / Library)
110
Step 2. 新增資料館 (New / Library)
111
Step 2. 新增資料館 (New / Library)
112
Step 2. 新增資料館 (New / Library)
Step 3. 建立資料來源 (Create Data Source)
113
114
Step 3. 建立資料來源 (Create Data Source)
115
Step 3. 建立資料來源 (Create Data Source)
116
Step 3. 建立資料來源 (Create Data Source)
117
Step 3. 建立資料來源 (Create Data Source)
118
EM_LIB.PROFILE
Step 3. 建立資料來源 (Create Data Source)
119
Step 3. 建立資料來源 (Create Data Source)
120
Step 3. 建立資料來源 (Create Data Source)
121
Step 3. 建立資料來源 (Create Data Source)
122
Step 3. 建立資料來源 (Create Data Source)
123
Step 3. 建立資料來源 (Create Data Source)
124
Step 3. 建立資料來源 (Create Data Source)
125
Step 3. 建立資料來源 (Create Data Source)
126
Step 3. 建立資料來源 (Create Data Source)
127
Step 3. 建立資料來源 (Create Data Source)
128
Step 4. 建立流程圖 (Create Diagram)
129
Step 4. 建立流程圖 (Create Diagram)
130
Step 4. 建立流程圖 (Create Diagram)
SAS Enterprise Miner (SAS EM) Case Study
• SAS EM 資料匯入4步驟
– Step 1. 新增專案 (New Project)
– Step 2. 新增資料館 (New / Library)
– Step 3. 建立資料來源 (Create Data Source)
– Step 4. 建立流程圖 (Create Diagram)
• SAS EM SEMMA 建模流程
131
132
案例情境模型流程
133
EM_Lib.Profile
134
135
136
137
138
139
篩選-間隔變數 …
140
141
141
篩選-匯入的資料 … [勘查]
142
143
篩選-匯入的資料
144
篩選-匯出的資料 … [勘查]
145
篩選-匯出的資料
勘查 (Explore)-群集 (Cluster)
146
147
148
149
150
群集 (Cluster) 結果
151
評估 (Assess)-區段特徵描繪(Segment Profile)
152
區段特徵描繪 (Segment Profile)
153
評估 (Assess)-區段特徵描繪(Segment Profile)
154
評估 (Assess)-區段特徵描繪(Segment Profile)
155
評估 (Assess)-區段特徵描繪(Segment Profile)
156
評估 (Assess)-區段特徵描繪(Segment Profile)
區段特徵描繪 (Segment Profile)
157
158
區段特徵描繪 (Segment Profile)
References • Efraim Turban, Ramesh Sharda, Dursun Delen,
Decision Support and Business Intelligence Systems, Ninth Edition, 2011, Pearson.
• Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Second Edition, 2006, Elsevier
• Jim Georges, Jeff Thompson and Chip Wells, Applied Analytics Using SAS Enterprise Miner, SAS, 2010
• SAS Enterprise Miner Course Notes, 2014, SAS
• SAS Enterprise Miner Training Course, 2014, SAS
• SAS Enterprise Guide Training Course, 2014, SAS
159