"빅" 데이터의 분석적 시각화
DESCRIPTION
2013년 11월 29일 한국보건정보통계학회 발표TRANSCRIPT
2013.11.29 Health Info & Stat
1
한국보건정보통계학회 추계학술발표회 2013
“빅”데이터의 분석적 시각화Analytic Data Visualization
許 明 會 고려대학교 통계학과 [email protected]
2013.11.29 Health Info & Stat
2
Data Visualization- Descriptive vs Analytic ...- Small vs Big ...
science
technology
art
2013.11.29 Health Info & Stat
3
Contents- Scatterplot- Biplot- Regression Biplot- Kernel PCA- SVM Biplot
2013.11.29 Health Info & Stat
4
Scatterplot: 산점도
- “Lego” for analytic data visualization
- Reflecting the third variable
quakes: longitude(=x), latitude(=y), depth(=z)
2013.11.29 Health Info & Stat
5
Scatterplot: 산점도
- For the case of large (≧ ), over-plotting can produce serious outcome.
Skin Segmentation Data: (red) vs. (green)
2013.11.29 Health Info & Stat
6
Scatterplot: 산점도
- For the case of large (≧ ), alpha channel can be utilized.
Skin Segmentation Data: (red) vs. (green)
2013.11.29 Health Info & Stat
7
Scatterplot: 산점도
- lowess: A nonparametric regression for bivariate data
cars data: distance vs. speed
2013.11.29 Health Info & Stat
8
Scatterplot: 산점도
- 3D Rotation for three variables
Skin Segmentation Data: (red), (green), (blue)
- ggobi: 3D Rotation for four or more variables
2013.11.29 Health Info & Stat
9
Biplot of Observations and Variables, Gabriel (1971)
- The biplot is a graph that shows observations and variables.
Protein data (row: 25 nations, column: 9 protein sources)
2013.11.29 Health Info & Stat
10
Biplot of Observations and Variables, Gabriel (1971)
- Idea: Linear projection
Protein data: variable cereal
2013.11.29 Health Info & Stat
11
Regression Biplot, Huh and Lee (2013)
- Regression biplot is a graph for observations of ⋯ , arranged by predicted .
- Assume that the model fit is determined by a function of linear combination of ⋯ . For instance,
⋯ , or log ⋯ .
- Set the vertical dimension by the direction of regression coefficients
⋮
, or ∥∥
.
- Set the horizontal dimension by the direction of principal axis of
⋯ ,
where denotes the orthogonal component generated from the
projection of on .
2013.11.29 Health Info & Stat
12
Regression Biplot, Huh and Lee (2013)
Example 1. Stack Loss Data ( ; loss of ammonia, )
2013.11.29 Health Info & Stat
13
Regression Biplot, Huh and Lee (2013)
Example 2. Magazine Data ( ; Subscription (0,1), )
2013.11.29 Health Info & Stat
14
Kernel PCA, Scholkopf et al. (1998)
- For observations ⋯ (× ), consider the nonlinear mapping ⋯
to a Hilbert space, in which .
- Denoting , Kernel PCA is obtained from eigen-decomposing
.
- Kernel PCA yields a plot of observations by projecting ⋯ on
′
′ ′ ,
where , is an eigenvector of .
2013.11.29 Health Info & Stat
15
Kernel PCA Diagram (or Kernel Biplot), Huh (2013)
- Aim: Representation of variables in Kernel PC plot of observations.
- Proposed Procedure:1) For each ⋯ , map
on the plane, ⋯ , where is a constant and ⋯ ⋯ . Projection is given by
′
′ ′
″
″
″
′″
″
″′
″″′.
2) For each , link the projection points of and by an arrow.
2013.11.29 Health Info & Stat
16
Example 1. Arrow diagrams [ ] for kernel PCA of the iris data with rbf kernel,
2013.11.29 Health Info & Stat
17
Example 1. Arrow diagrams [ ] for kernel PCA of the iris data with rbf kernel,
2013.11.29 Health Info & Stat
18
Example 2. Arrow diagrams [ ] for kernel PCA of the spam data [ ]
2013.11.29 Health Info & Stat
19
SVM-Guided Biplot as an extension of Regression Biplot
- Idea: Combine Linear/Logistic Regression Biplot and Kernel PCA.
- Classification/Regression Part:
Classified as -1 or 1 for ⋯ .
SVM classifier ,
where
, ≧ .
Vertical dimension is set to
( ,
).
2013.11.29 Health Info & Stat
20
SVM-Guided Biplot: Classification
- Kernel PCA Part:
(
′
′ ′ ), ⋯ .
∴ ′ ′ ′
′ ′ , ′ ⋯ . Hence → ( ) or
.
Horizontal dimension is determined by eigen-decomposing .
- Perturbation Scheme for Arrow Diagrams.
Define , × , where represents a perturbation of which the magnitude is controlled by . Then, project on the first (vertical) and the second (horizontal) dimension.
2013.11.29 Health Info & Stat
21
Example 1. Iris Data: Versicolor vs. Virginica [sigma=0.1, C=1, ]
2013.11.29 Health Info & Stat
22
Importance of Variables (in the case of large )
- It is necessary to select a small number of variables in determining the first and second dimensions.
- Measures of Importance (definition) Length of Arrows 1) in vertical direction, 2) in horizontal direction.- Plot arrow diagrams for importance variables only.
2013.11.29 Health Info & Stat
23
Example 2. Spam Data [sigma=0.1, C=10, ],
2013.11.29 Health Info & Stat
24
SVM-Guided Biplot: Regression - The same method can be applied to SVM regression.
- Example 3. Aerobic Fitness [ ] for oxygen uptake (= ) with RBF kernel ( =0.1, C=10, =0.1, )
2013.11.29 Health Info & Stat
25
Concluding Remarks - Biplot method can be extended to be suited for linear regression or
classification (logistic regression). - Biplot method can be extended to allow nonlinear mapping of
observations and variables, by fully utilizing kernel trick.
http://blog.naver.com/huh4200
금붕어 어항 (on the iPad)
2013.11.29 Health Info & Stat
26
References
Gabriel, K.R. (1971). “The biplot display of matrices with the application to principal component analysis”. Biometrika, 58. 453-467.
Huh, M.H. (2013). “Arrow diagrams for kernel principal component analysis”. Communications for Statistical Applications and Methods, 20. 175-184.
Huh, M.H. (2013). “SVM-guided biplot of observations and variables”. Communications for Statistical Applications and Methods. (to appear)
Huh, M.H. and Lee, Y.G. (2013). “Biplots of multivariate data guided by linear and/or logistic regression”. Communications for Statistical Applications and Methods, 20. 129-136.
Scholkopf, B., Smola, A. and Muller, K.R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10. 1299–1319.