feature extraction 主講人:虞台文. content principal component analysis (pca) pca calculation...

Download Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation  for Fewer-Sample Case Factor Analysis Fishers Linear Discriminant

If you can't read please download the document

Upload: emily-fisher

Post on 18-Jan-2018

241 views

Category:

Documents


0 download

DESCRIPTION

Feature Extraction Principal Component Analysis (PCA)

TRANSCRIPT

Feature Extraction Content Principal Component Analysis (PCA) PCA Calculation for Fewer-Sample Case Factor Analysis Fishers Linear Discriminant Analysis Multiple Discriminant Analysis Feature Extraction Principal Component Analysis (PCA) Principle Component Analysis It is a linear procedure to find the direction in input space where most of the energy of the input lies. Feature Extraction Dimension Reduction It is also called the (discrete) Karhunen- Love transform, or the Hotelling transform. The Basis Concept Assume data x (random vector) has zero mean. PCA finds a unit vector w to reflect the largest amount of variance of the data. That is, w x wTxwTx Demo The Method Covariance Matrix Remark: C is symmetric and semipositive definite. The Method maximize subject to The method of Lagrange multiplier: Define The extreme point, say, w* satisfies The Method maximize subject to Setting Discussion At extreme points w is a eigenvector of C, and is its corresponding eigenvalue. Let w 1, w 2, , w d be the eigenvectors of C whose corresponding eigenvalues are 1 2 d. They are called the principal components of C. Their significance can be ordered according to their eigenvalues. Let w 1, w 2, , w d be the eigenvectors of C whose corresponding eigenvalues are 1 2 d. They are called the principal components of C. Their significance can be ordered according to their eigenvalues. Discussion At extreme points Let w 1, w 2, , w d be the eigenvectors of C whose corresponding eigenvalues are 1 2 d. They are called the principal components of C. Their significance can be ordered according to their eigenvalues. Let w 1, w 2, , w d be the eigenvectors of C whose corresponding eigenvalues are 1 2 d. They are called the principal components of C. Their significance can be ordered according to their eigenvalues. If C is symmetric and semipositive definite, all their eigenvectors are orthogonal. They, hence, form a basis of the feature space. For dimensionality reduction, only choose few of them. If C is symmetric and semipositive definite, all their eigenvectors are orthogonal. They, hence, form a basis of the feature space. For dimensionality reduction, only choose few of them. Applications Image Processing Signal Processing Compression Feature Extraction Pattern Recognition Example Projecting the data onto the most significant axis will facilitate classification. This also achieves dimensionality reduction. Issues PCA is effective for identifying the multivariate signal distribution. Hence, it is good for signal reconstruction. But, it may be inappropriate for pattern classification. The most significant component obtained using PCA. The most significant component for classification Whitening Whitening is a process that transforms the random vector, say, x = (x 1, x 2, , x n ) T (assumed it is zero mean) to, say, z = (z 1, z 2, , z n ) T with zero mean and unit variance. z is said to be white or sphered. This implies that all of its elements are uncorrelated. However, this doesnt implies its elements are independent. Whitening Transform Let V be a whitening transform, then Decompose C x as Set Clearly, D is a diagonal matrix and E is an orthonormal matrix. Whitening Transform If V is a whitening transform, and U is any orthonormal matrix, show that UV, i.e., rotation, is also a whitening transform. Proof) Why Whitening? With PCA, we usually choose several major eigenvectors as the basis for representation. This basis is efficient for reconstruction, but may be inappropriate for other applications, e.g., classification. By whitening, we can rotate the basis to get more interesting features. Feature Extraction PCA Calculation for Fewer-Sample Case Complexity for PCA Calculation Let C be of size n n Time complexity by direct computation O(n 3 ) Are there any efficient method in case that Consider N samples of with Define PCA for Covariance Matrix from Fewer Samples Define N N matrix Let be the orthonormal eigenvectors of of T with corresponding eigenvalues i, i.e., PCA for Covariance Matrix from Fewer Samples Eigenvectors of C PCA for Covariance Matrix from Fewer Samples Define PCA for Covariance Matrix from Fewer Samples Define p i are orthonormal eigenvectors of C with eigenvalues Feature Extraction Factor Analysis What is a Factor? If several variables correlate highly, they might measure aspects of a common underlying dimension. These dimensions are called factors. Factors are classification axis along which the measures can be plotted. The greater the loading of variables on a factor, the more that factor can explain intercorrelations between those variables. Graph Representation Quantitative Skill (F 1 ) Verbal Skill (F 2 ) 11 11 +1 What is Factor Analysis? A method for investigating whether a number of variables of interest Y 1, Y 2, , Y n, are linearly related to a smaller number of unobservable factors F 1, F 2, , F m. For data reduction and summarization. Statistical approach to analyze interrelationships among the large number of variables & to explain these variables in term of their common underlying dimensions (factors). Example Observable Data What factors influence students grades? Quantitative skill? Verbal skill? unobservable The Model y: Observation Vector B: Factor-Loading Matrix f: Factor Vector : Gaussian-Noise Matrix The Model y: Observation Vector B: Factor-Loading Matrix f: Factor Vector : Gaussian-Noise Matrix The Model Can be estimated from data Can be obtained from the model The Model Commuality Specific Variance Explained Unexplained Example Cy Cy BB T + Q = Goal Our goal is to minimize Hence, Uniqueness Is the solution unique? There are infinite number of solutions. Since if B* is a solution and T is an orthonormal transformation (rotation), then BT is also a solution. Example Cy =Cy = Which one is better? Example i1i1 i2i2 i1i1 i2i2 Left: each factor have nonzero loading for all variables. Right: each factor controls different variables. The Method Determine the first set of loadings using principal component method. Example Cy Cy Factor Rotation Factor-Loading Matrix Rotation Matrix Factor Rotation: Factor Rotation Criteria: Varimax Quartimax Equimax Orthomax Oblimin Factor-Loading Matrix Factor Rotation: Varimax Let... Maxmize Subject to Criterion: Varimax Maxmize Subject to Criterion: Construct the Lagrangian Varimax c jk dkdk b jk Varimax Define is the k th column of Varimax is the k th column of Varimax reaches maximum once Goal: Varimax Goal: Initially, obtain B 0 by whatever method, e.g., PCA. set T 0 as the approximation rotation matrix, e.g., T 0 =I. Iteratively execute the following procedure: evaluateand You need information of B 1. findandsuch that Next slide ifstop Repeat Varimax Goal: Initially, obtain B 0 by whatever method, e.g., PCA. set T 0 as the approximation rotation matrix, e.g., T 0 =I. Iteratively execute the following procedure: evaluateand You need information of B 1. findandsuch that Next slide ifstop Repeat Pre-multiplying each side by its transpose. Varimax Criterion: Maximize... Varimax Maximize Let Feature Extraction Fishers Linear Discriminant Analysis Main Concept PCA seeks directions that are efficient for representation. Discriminant analysis seeks directions that are efficient for discrimination. Classification Efficiencies on Projections Criterion Two-Category 1 m 2 m w ||w|| = 1 Scatter 1 m 2 m w ||w|| = 1 Between-Class Scatter The larger the better Between-Class Scatter Matrix Scatter 1 m 2 m w ||w|| = 1 Within-Class Scatter The smaller the better Within-Class Scatter Matrix Between-Class Scatter Matrix Goal 1 m 2 m w ||w|| = 1 Within-Class Scatter Matrix Between-Class Scatter Matrix Define Generalized Rayleigh quotient The length of w is immaterial. Generalized Eigenvector Generalized Rayleigh quotient To maximize J(w), w is the generalized eigenvector associated with largest generalized eigenvalue. That is, or The length of w is immaterial. Define Proof To maximize J(w), w is the generalized eigenvector associated with largest generalized eigenvalue. That is, or Set Example 2 1 m m w w w Feature Extraction Multiple Discriminant Analysis Generalization of Fishers Linear Discriminant For the c-class problem, we seek a (c 1)-dimension projection for efficient discrimination. Scatter Matrices Feature Space 1 m 2 m 3 m + Total Scatter Matrix Within-Class Scatter Matrix Between-Class Scatter Matrix The (c 1) -Dim Projection 1 m 2 m 3 m + The projection space will be described using a d (c 1) matrix W. 1 ~ m 2 ~ m 3 ~ m + Scatter Matrices Projection Space 1 m 2 m 3 m + Total Scatter Matrix Within-Class Scatter Matrix Between-Class Scatter Matrix W Criterion Total Scatter Matrix Within-Class Scatter Matrix Between-Class Scatter Matrix