Multivariate Statistical Methods Principal Components Analysis PCA By
Multivariate Statistical Methods Principal Components Analysis (PCA) By Jen-pei Liu, Ph. D Division of Biometry, Department of Agronomy, National Taiwan University and Division of Biostatistics and Bioinformatics National Health Research Institutes 2/11/2022 Copyright by Jen-pei Liu, Ph. D 1
Principal Components Analysis n n n Introduction Procedures Properties Examples Summary 2/11/2022 Copyright by Jen-pei Liu, Ph. D 2
Introduction n Described by K. Pearson (1901) Computing methods by Hotelling (1933) Objective n To transform the original variables X 1, …, Xp into index variables Z 1, …, Zp n n n 2/11/2022 Z 1, …, Zp are linear combinations of X 1, …, Xp Z 1, …, Zp are independent and are in order of important To describe the variation in the data Copyright by Jen-pei Liu, Ph. D 3
Introduction n n Lack of correlation index variables measure different dimensions (domains) Lack of correlation only consider the variance of index variables and do not have to take covariance into consideration Ordering Var(Z 1) Var(Z 2) … Var(Zp) The Z index variables are called the principal components 2/11/2022 Copyright by Jen-pei Liu, Ph. D 4
Introduction n Variance of the variation in the full data set can be adequately describe by the few Z index variables Reduction of dimension from 2 -digit number to just 2 to 4 principal compoents High correlations in the original variables 2/11/2022 Copyright by Jen-pei Liu, Ph. D 5
Introduction 2/11/2022 Copyright by Jen-pei Liu, Ph. D 6
Introduction Correlations of Female Sparrows X 1 X 2 X 3 X 4 X 5 Total length (X 1) 1. 000 Alar length (X 2) 0. 735 1. 000 Length of beak and Head (X 3) 0. 662 0. 674 1. 000 Length of humerus (X 4) 0. 645 0. 769 0. 763 1. 000 Length of keel of sternum (X 5) 0. 605 0. 529 0. 626 0. 607 1. 000 2/11/2022 Copyright by Jen-pei Liu, Ph. D 7
Introduction Component Variance 1 3. 616 2 0. 532 3 0. 386 4 0. 302 5 0. 165 2/11/2022 Coefficients for Components X 1 X 2 X 3 X 4 X 5 0. 452 0. 462 0. 451 0. 471 0. 398 -0. 051 0. 300 0. 325 0. 185 -0. 877 0. 691 0. 341 -0. 455 -0. 411 -0. 179 -0. 420 0. 548 -0. 606 0. 388 0. 069 0. 374 -0. 530 -0. 343 0. 652 -0. 192 Copyright by Jen-pei Liu, Ph. D 8
Introduction Z 1=0. 452 X 1+0. 462 X 2+0. 451 X 3+0. 471 X 4+0. 398 X 5 Variance of Z 1 is 3. 62 Variance of Z 1 accounts for 72. 3% (3. 62/5. 00) of the total variation All coefficients of Z 1 are smaller than 1 and sum of squares of these coefficients is equal to 1 Z 1 is in fact as the average (or sum) of X 1, X 2, X 3, X 4, and X 5 Z 1 can be interpreted as the index for the size of the sparrow 2/11/2022 Copyright by Jen-pei Liu, Ph. D 9
Procedures Case 1 2. . N 2/11/2022 Data Structure X 1 X 2 … Xp x 11 x 12 … x 1 p x 21 x 22 … x 2 p xn 1 xn 2 … xnp Copyright by Jen-pei Liu, Ph. D 10
Procedures n The First Component The first component is a linear combination of X 1, X 2, …, Xp n Z 1= a 11 X 1+a 12 X 2+…+a 1 p. Xp n Var(Z 1) is as large as possible subject to condition that a 112+a 122+…+a 1 p 2=1 n 2/11/2022 Copyright by Jen-pei Liu, Ph. D 11
Procedures 2/11/2022 Copyright by Jen-pei Liu, Ph. D 12
Procedures n The second Component The second component is also a linear combination of X 1, X 2, …, and Xp n Z 1= a 21 X 1+a 22 X 2+…+a 2 p. Xp n Var(Z 2) is as large as possible subject to condition that a 212+a 222+…+a 2 p 2=1, Var(Z 2) is the second largest, Z 1 and Z 2 are not correlated n 2/11/2022 Copyright by Jen-pei Liu, Ph. D 13
Procedures n The third Component The third component is also a linear combination of X 1, X 2, …, and Xp n Z 1= a 31 X 1+a 32 X 2+…+a 3 p. Xp n Var(Z 2) is as large as possible subject to condition that a 312+a 322+…+a 3 p 2=1, Var(Z 3) is the second largest, Z 1, Z 2 and Z 3 are not correlated n 2/11/2022 Copyright by Jen-pei Liu, Ph. D 14
Procedures n n Continue until all p principal components are computed Covariance matrix of p variables 2/11/2022 Copyright by Jen-pei Liu, Ph. D 15
Procedures 2/11/2022 Copyright by Jen-pei Liu, Ph. D 16
Procedures 2/11/2022 Copyright by Jen-pei Liu, Ph. D 17
Procedures n n Different variables might have different units and magnitudes PCA might be influenced by these magnitudes and units Standardization to have zero mean and unit variance Covariance on standardized variables is the correlation matrix 2/11/2022 Copyright by Jen-pei Liu, Ph. D 18
Procedures n Steps of (PCA) n n Standardizing variables X 1, X 2, …, Xp to have zero means and unit variances unless that the importance of variables is reflected in their variances Calculate the covariance matrix (correlation matrix) 2/11/2022 Copyright by Jen-pei Liu, Ph. D 19
Procedures n Steps of (PCA) n n n Find the eigenvalues 1, 2, …, p and their corresponding eigenvectors a 1, a 2, …, ap The coefficients of the ith principal component Zi is the element of ai and i the variance of Zi Discard any components that accounts for only a small proportion of the variation in the data 2/11/2022 Copyright by Jen-pei Liu, Ph. D 20
Properties 2/11/2022 Copyright by Jen-pei Liu, Ph. D 21
Properties n n n E(Z)=A V(Z)=A A’=diag{ I, i=1, …, p} Cov(Zi, Xj)=aij i Corr(Zi, Xj)=aij i/cjj Corr(Zi, Xj)=aij i, if correlation matrix is used 2/11/2022 Copyright by Jen-pei Liu, Ph. D 22
Examples n Determination of the number of principal components n n Depends upon the needs of practitioners The proportion of the total variation explained by the selected principal components is high, e. g. , at least 80% If correlation matrix is used, select the principal component with the variance greater than 1 because they accounts for more variation than the original variables (=1) Use scree plot 2/11/2022 Copyright by Jen-pei Liu, Ph. D 23
Examples n Evaluation of Statistics Course n n n 16 students for 11 items (variables) Evaluation scales: 1(poor or not at all) to 5(excellent, strongly, or difficult) The first two principal components explain 76. 0% of total variation and the last four principal components explain only 2. 2% 2/11/2022 Copyright by Jen-pei Liu, Ph. D 24
Examples 2/11/2022 Copyright by Jen-pei Liu, Ph. D 25
Examples n Test scores of 10 students in 4 subjects Subject Chinese(X 1) English(X 2) Math(X 3) Social(X 4) 1 85 76 60 85 2 90 95 80 72 Student 3 4 60 70 45 65 38 60 80 76 5 68 56 70 70 6 77 80 65 68 7 50 30 40 80 8 80 70 60 66 9 85 75 65 84 10 55 60 40 50 Source: Shen (1998) 2/11/2022 Copyright by Jen-pei Liu, Ph. D 26
Examples X 1 X 2 X 3 X 4 2/11/2022 X 1 1 Correlation Matrix X 2 X 3 X 4 0. 8846 0. 8375 0. 2784 1 0. 8059 -0. 1101 1 0. 1118 1 Copyright by Jen-pei Liu, Ph. D 27
Examples n Eigenvalues and Eigenvectors Eigenvalue 2. 70159 1. 06380 0. 19870 0. 03591 2/11/2022 Prop. 0. 6754 0. 2660 0. 0497 0. 0090 Cum. Eigenvector Prop. X 1 X 2 0. 6554 0. 5897 0. 1254 0. 9414 0. 1254 -0. 2651 0. 9910 0. 3592 0. 4378 1. 0000 -0. 7124 0. 6444 Copyright by Jen-pei Liu, Ph. D X 3 X 4 0. 3592 -0. 7124 -0. 0281 0. 9556 -0. 8227 0. 0501 0. 0485 0. 2737 28
Examples n n n Because the first two principal components account for 94. 14%, we can just use these two principal components The first principal component can be interpreted as the index for the sum of Chinese, English and math The second principal component can be thought as social science 2/11/2022 Copyright by Jen-pei Liu, Ph. D 29
Examples n n n The above results can be also obtained by inspecting the correlation matrix Correlations among Chinese, English, and math exceed 0. 8 Correlations between Chinese, English, and math with social science are below 0. 3 2/11/2022 Copyright by Jen-pei Liu, Ph. D 30
Examples Correlation between the first principal component with original variables Corr(Z 1, X 1)=a 11 1 =0. 5897 2. 70159=0. 9692 Corr(Z 1, X 2)=a 12 1 =0. 5682 2. 70159=0. 9339 Corr(Z 1, X 3)=a 13 1 =0. 5657 2. 70159=0. 9298 Corr(Z 1, X 4)=a 14 i = 0. 0969 2. 70159=0. 1592 2/11/2022 Copyright by Jen-pei Liu, Ph. D 31
Examples Correlation between the second principal component with original variables Corr(Z 2, X 1)=a 21 2 Corr(Z 2, X 2)=a 22 2 Corr(Z 2, X 3)=a 23 2 Corr(Z 2, X 4)=a 24 2 2/11/2022 =0. 1254 1. 0638=0. 1294 =-0. 2651 1. 0638=-0. 2734 =-0. 0281 1. 0638=-0. 0290 = 0. 9556 1. 0638=0. 9856 Copyright by Jen-pei Liu, Ph. D 32
Examples Student 1 2 3 4 5 6 7 8 9 10 2/11/2022 1 st Component 0. 91883 2. 58868 -1. 85920 0. 03527 0. 01741 0. 92643 -2. 67248 0. 52758 1. 32646 -1. 80897 2 nd Component 1. 12685 -0. 41488 0. 84509 0. 23932 -0. 21745 -0. 65337 0. 96553 -0. 65459 0. 92471 -0. 16121 Copyright by Jen-pei Liu, Ph. D 33
Examples Correlations of Female Sparrows X 1 X 2 X 3 X 4 X 5 Total length (X 1) 1. 000 Alar length (X 2) 0. 735 1. 000 Length of beak and Head (X 3) 0. 662 0. 674 1. 000 Length of humerus (X 4) 0. 645 0. 769 0. 763 1. 000 Length of keel of sternum (X 5) 0. 605 0. 529 0. 626 0. 607 1. 000 2/11/2022 Copyright by Jen-pei Liu, Ph. D 34
Examples Component Variance 1 3. 616 2 0. 532 3 0. 386 4 0. 302 5 0. 165 2/11/2022 Coefficients for Components X 1 X 2 X 3 X 4 X 5 0. 452 0. 462 0. 451 0. 471 0. 398 -0. 051 0. 300 0. 325 0. 185 -0. 877 0. 691 0. 341 -0. 455 -0. 411 -0. 179 -0. 420 0. 548 -0. 606 0. 388 0. 069 0. 374 -0. 530 -0. 343 0. 652 -0. 192 Copyright by Jen-pei Liu, Ph. D 35
Examples n The first principal component Z 1=0. 452 X 1+0. 462 X 2+0. 451 X 3+0. 471 X 4+0. 398 X 5 n n An index of bird size The second principal component Z 2=-0. 051 X 1+0. 300 X 2+0. 325 X 3+0. 185 X 4 -0. 877 X 5 n An index of bird shape 2/11/2022 Copyright by Jen-pei Liu, Ph. D 36
Examples n The value of the first principal component for the first bird Z 1=0. 452(-0. 542)+0. 462(0. 725)+0. 451(0. 177)+ 0. 471(0. 055)+0. 398(-0. 33) = 0. 064 n The value of the second principal component for the first bird Z 2=-0. 051(-0. 542)+0. 300(0. 725)+0. 325(0. 177)+ 0. 185(0. 055)+(-0. 877(-0. 33) = 0. 602 2/11/2022 Copyright by Jen-pei Liu, Ph. D 37
Examples Mean 1 2 3 4 5 Survivor -0. 100 0. 004 -0. 140 0. 073 0. 023 2/11/2022 Standard Deviation Nonsurvivor 0. 075 -0. 003 0. 105 -0. 055 -0. 017 Survivor 1. 506 0. 684 0. 522 0. 563 0. 411 Copyright by Jen-pei Liu, Ph. D Nonsurvivor 2. 176 0. 776 0. 677 0. 543 0. 408 38
Examples n Employment in European Countries AGR MIN MAN PS CON SER FIN SPC TC AGR 1. 000 MIN 0. 316 1. 000 MAN -0. 254 -0. 672 1. 000 PS(3) -0. 382 -0. 387 0. 388 1. 000 CON -0. 349 -0. 129 -0. 034 0. 165 1. 000 SER -0. 605 -0. 407 -0. 033 0. 155 0. 473 1. 000 FIN -0. 176 -0. 248 -0. 274 0. 094 -0. 018 0. 379 1. 000 SPC -0. 811 -0. 316 0. 050 0. 238 0. 072 0. 388 0. 166 1. 000 TC -0. 487 0. 045 0. 243 0. 105 -0. 055 -0. 085 -0. 391 0. 475 1. 000 2/11/2022 Copyright by Jen-pei Liu, Ph. D 39
Examples n 9 eigenvalues: 3. 112(34. 6%), 1. 809(20. 1%), 1. 496(16. 6%), 1. 063(11. 8%), 0. 710(7. 9%) 0. 311(3. 5%), 0. 293(3. 3%), 0. 204(2. 4%), and 0(0. 0%) The sum of percent employment is 1 The columns of correlation matrix are linearly dependent The last eigenvalue is 0 2/11/2022 Copyright by Jen-pei Liu, Ph. D 40
Examples n n Select the principal components with eigenvaleues greater than 1 the first 4 principal components that explain 85% of the total variation in the data If we take first two principal components which can account only for 55% of total variation 2/11/2022 Copyright by Jen-pei Liu, Ph. D 41
Examples n The first principal component Z 1=0. 51(AGR)+0. 37(Min)-0. 25(MAN)0. 31(PS)-0. 22(CON)-0. 38(SER)-0. 13(FIN)0. 42(SPS)-0. 21(TC) n A contrast between AGR(agriculture, forestry, and fishing) and MIN(mining and quarrying) versus others 2/11/2022 Copyright by Jen-pei Liu, Ph. D 42
Examples n The second principal component Z 1=-0. -2(AGR)+0. 00(Min)+0. 43(MAN) +0. 11(PS)-0. 24(CON)-0. 41(SER) -0. 55(FIN)+0. 05(SPS)+0. 52(TC) n A contrast between MAN(manufacturing) and TC(transport and communication) versus CON(construction), SER(service industry) and FIN(finance) 2/11/2022 Copyright by Jen-pei Liu, Ph. D 43
2/11/2022 Copyright by Jen-pei Liu, Ph. D 44
2/11/2022 Copyright by Jen-pei Liu, Ph. D 45
Summary n n A linear combination of the original variables Try to reduce a large number of variables to a few index variables Index variables are not correlated and ordered in the magnitude of variation Illustration with real examples 2/11/2022 Copyright by Jen-pei Liu, Ph. D 46
- Slides: 46