"빅" 데이터의 분석적 시각화

26
2013.11.29 Health Info & Stat 1 한국보건정보통계학회 추계학술발표회 2013 “빅” 데이터의 분석적 시각화 Analytic Data Visualization 許 明 會 고려대학교 통계학과 [email protected]

Upload: myung-hoe-huh

Post on 30-Jun-2015

600 views

Category:

Education


5 download

DESCRIPTION

2013년 11월 29일 한국보건정보통계학회 발표

TRANSCRIPT

Page 1: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

1

한국보건정보통계학회 추계학술발표회 2013

“빅”데이터의 분석적 시각화Analytic Data Visualization

許 明 會 고려대학교 통계학과 [email protected]

Page 2: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

2

Data Visualization- Descriptive vs Analytic ...- Small vs Big ...

science

technology

art

Page 3: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

3

Contents- Scatterplot- Biplot- Regression Biplot- Kernel PCA- SVM Biplot

Page 4: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

4

Scatterplot: 산점도

- “Lego” for analytic data visualization

- Reflecting the third variable

quakes: longitude(=x), latitude(=y), depth(=z)

Page 5: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

5

Scatterplot: 산점도

- For the case of large (≧ ), over-plotting can produce serious outcome.

Skin Segmentation Data: (red) vs. (green)

Page 6: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

6

Scatterplot: 산점도

- For the case of large (≧ ), alpha channel can be utilized.

Skin Segmentation Data: (red) vs. (green)

Page 7: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

7

Scatterplot: 산점도

- lowess: A nonparametric regression for bivariate data

cars data: distance vs. speed

Page 8: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

8

Scatterplot: 산점도

- 3D Rotation for three variables

Skin Segmentation Data: (red), (green), (blue)

- ggobi: 3D Rotation for four or more variables

Page 9: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

9

Biplot of Observations and Variables, Gabriel (1971)

- The biplot is a graph that shows observations and variables.

Protein data (row: 25 nations, column: 9 protein sources)

Page 10: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

10

Biplot of Observations and Variables, Gabriel (1971)

- Idea: Linear projection

Protein data: variable cereal

Page 11: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

11

Regression Biplot, Huh and Lee (2013)

- Regression biplot is a graph for observations of ⋯ , arranged by predicted .

- Assume that the model fit is determined by a function of linear combination of ⋯ . For instance,

⋯ , or log ⋯ .

- Set the vertical dimension by the direction of regression coefficients

, or ∥∥

.

- Set the horizontal dimension by the direction of principal axis of

⋯ ,

where denotes the orthogonal component generated from the

projection of on .

Page 12: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

12

Regression Biplot, Huh and Lee (2013)

Example 1. Stack Loss Data ( ; loss of ammonia, )

Page 13: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

13

Regression Biplot, Huh and Lee (2013)

Example 2. Magazine Data ( ; Subscription (0,1), )

Page 14: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

14

Kernel PCA, Scholkopf et al. (1998)

- For observations ⋯ (× ), consider the nonlinear mapping ⋯

to a Hilbert space, in which .

- Denoting , Kernel PCA is obtained from eigen-decomposing

.

- Kernel PCA yields a plot of observations by projecting ⋯ on

′ ′ ,

where , is an eigenvector of .

Page 15: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

15

Kernel PCA Diagram (or Kernel Biplot), Huh (2013)

- Aim: Representation of variables in Kernel PC plot of observations.

- Proposed Procedure:1) For each ⋯ , map

on the plane, ⋯ , where is a constant and ⋯ ⋯ . Projection is given by

′ ′

′″

″′

″″′.

2) For each , link the projection points of and by an arrow.

Page 16: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

16

Example 1. Arrow diagrams [ ] for kernel PCA of the iris data with rbf kernel,

Page 17: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

17

Example 1. Arrow diagrams [ ] for kernel PCA of the iris data with rbf kernel,

Page 18: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

18

Example 2. Arrow diagrams [ ] for kernel PCA of the spam data [ ]

Page 19: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

19

SVM-Guided Biplot as an extension of Regression Biplot

- Idea: Combine Linear/Logistic Regression Biplot and Kernel PCA.

- Classification/Regression Part:

Classified as -1 or 1 for ⋯ .

SVM classifier ,

where

, ≧ .

Vertical dimension is set to

( ,

).

Page 20: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

20

SVM-Guided Biplot: Classification

- Kernel PCA Part:

(

′ ′ ), ⋯ .

∴ ′ ′ ′

′ ′ , ′ ⋯ . Hence → ( ) or

.

Horizontal dimension is determined by eigen-decomposing .

- Perturbation Scheme for Arrow Diagrams.

Define , × , where represents a perturbation of which the magnitude is controlled by . Then, project on the first (vertical) and the second (horizontal) dimension.

Page 21: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

21

Example 1. Iris Data: Versicolor vs. Virginica [sigma=0.1, C=1, ]

Page 22: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

22

Importance of Variables (in the case of large )

- It is necessary to select a small number of variables in determining the first and second dimensions.

- Measures of Importance (definition) Length of Arrows 1) in vertical direction, 2) in horizontal direction.- Plot arrow diagrams for importance variables only.

Page 23: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

23

Example 2. Spam Data [sigma=0.1, C=10, ],

Page 24: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

24

SVM-Guided Biplot: Regression - The same method can be applied to SVM regression.

- Example 3. Aerobic Fitness [ ] for oxygen uptake (= ) with RBF kernel ( =0.1, C=10, =0.1, )

Page 25: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

25

Concluding Remarks - Biplot method can be extended to be suited for linear regression or

classification (logistic regression). - Biplot method can be extended to allow nonlinear mapping of

observations and variables, by fully utilizing kernel trick.

http://blog.naver.com/huh4200

금붕어 어항 (on the iPad)

Page 26: "빅" 데이터의 분석적 시각화

2013.11.29 Health Info & Stat

26

References

Gabriel, K.R. (1971). “The biplot display of matrices with the application to principal component analysis”. Biometrika, 58. 453-467.

Huh, M.H. (2013). “Arrow diagrams for kernel principal component analysis”. Communications for Statistical Applications and Methods, 20. 175-184.

Huh, M.H. (2013). “SVM-guided biplot of observations and variables”. Communications for Statistical Applications and Methods. (to appear)

Huh, M.H. and Lee, Y.G. (2013). “Biplots of multivariate data guided by linear and/or logistic regression”. Communications for Statistical Applications and Methods, 20. 129-136.

Scholkopf, B., Smola, A. and Muller, K.R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10. 1299–1319.