Applied Multivariate Quantitative Methods Principal Components Analysis PCA

  • Slides: 46
Download presentation
Applied Multivariate Quantitative Methods Principal Components Analysis (PCA) By Jen-pei Liu, Ph. D Division

Applied Multivariate Quantitative Methods Principal Components Analysis (PCA) By Jen-pei Liu, Ph. D Division of Biometry, Department of Agronomy, National Taiwan University and Wei-Chie, MD, Ph. D Department of Public Health National Taiwan University 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 1

Principal Components Analysis n n n Introduction Procedures Properties Examples Summary 12/31/2021 Copyright by

Principal Components Analysis n n n Introduction Procedures Properties Examples Summary 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 2

Introduction n Described by K. Pearson (1901) Computing methods by Hotelling (1933) Objective n

Introduction n Described by K. Pearson (1901) Computing methods by Hotelling (1933) Objective n To transform the original variables X 1, …, Xp into index variables Z 1, …, Zp n n n 12/31/2021 Z 1, …, Zp are linear combinations of X 1, …, Xp Z 1, …, Zp are independent and are in order of important To describe the variation in the data Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 3

Introduction n n Lack of correlation index variables measure different dimensions (domains) Lack of

Introduction n n Lack of correlation index variables measure different dimensions (domains) Lack of correlation only consider the variance of index variables and do not have to take covariance into consideration Ordering Var(Z 1) Var(Z 2) … Var(Zp) The Z index variables are called the principal components 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 4

Introduction n Variance of the variation in the full data set can be adequately

Introduction n Variance of the variation in the full data set can be adequately describe by the few Z index variables Reduction of dimension from 2 -digit number to just 2 to 4 principal compoents High correlations in the original variables 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 5

Introduction 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D

Introduction 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 6

Introduction Correlations of Female Sparrows X 1 X 2 X 3 X 4 X

Introduction Correlations of Female Sparrows X 1 X 2 X 3 X 4 X 5 Total length (X 1) 1. 000 Alar length (X 2) 0. 735 1. 000 Length of beak and Head (X 3) 0. 662 0. 674 1. 000 Length of humerus (X 4) 0. 645 0. 769 0. 763 1. 000 Length of keel of sternum (X 5) 0. 605 0. 529 0. 626 0. 607 1. 000 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 7

Introduction Component Variance 1 3. 616 2 0. 532 3 0. 386 4 0.

Introduction Component Variance 1 3. 616 2 0. 532 3 0. 386 4 0. 302 5 0. 165 12/31/2021 Coefficients for Components X 1 X 2 X 3 X 4 X 5 0. 452 0. 462 0. 451 0. 471 0. 398 -0. 051 0. 300 0. 325 0. 185 -0. 877 0. 691 0. 341 -0. 455 -0. 411 -0. 179 -0. 420 0. 548 -0. 606 0. 388 0. 069 0. 374 -0. 530 -0. 343 0. 652 -0. 192 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 8

Introduction Z 1=0. 452 X 1+0. 462 X 2+0. 451 X 3+0. 471 X

Introduction Z 1=0. 452 X 1+0. 462 X 2+0. 451 X 3+0. 471 X 4+0. 398 X 5 Variance of Z 1 is 3. 62 Variance of Z 1 accounts for 72. 3% (3. 62/5. 00) of the total variation All coefficients of Z 1 are smaller than 1 and sum of squares of these coefficients is equal to 1 Z 1 is in fact as the average (or sum) of X 1, X 2, X 3, X 4, and X 5 Z 1 can be interpreted as the index for the size of the sparrow 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 9

Procedures Case 1 2. . N 12/31/2021 Data Structure X 1 X 2 …

Procedures Case 1 2. . N 12/31/2021 Data Structure X 1 X 2 … Xp x 11 x 12 … x 1 p x 21 x 22 … x 2 p xn 1 xn 2 … xnp Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 10

Procedures n The First Component The first component is a linear combination of X

Procedures n The First Component The first component is a linear combination of X 1, X 2, …, Xp n Z 1= a 11 X 1+a 12 X 2+…+a 1 p. Xp n Var(Z 1) is as large as possible subject to condition that a 112+a 122+…+a 1 p 2=1 n 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 11

Procedures 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D

Procedures 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 12

Procedures n The second Component The second component is also a linear combination of

Procedures n The second Component The second component is also a linear combination of X 1, X 2, …, and Xp n Z 1= a 21 X 1+a 22 X 2+…+a 2 p. Xp n Var(Z 2) is as large as possible subject to condition that a 212+a 222+…+a 2 p 2=1, Var(Z 2) is the second largest, Z 1 and Z 2 are not correlated n 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 13

Procedures n The third Component The third component is also a linear combination of

Procedures n The third Component The third component is also a linear combination of X 1, X 2, …, and Xp n Z 1= a 31 X 1+a 32 X 2+…+a 3 p. Xp n Var(Z 2) is as large as possible subject to condition that a 312+a 322+…+a 3 p 2=1, Var(Z 3) is the second largest, Z 1, Z 2 and Z 3 are not correlated n 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 14

Procedures n n Continue until all p principal components are computed Covariance matrix of

Procedures n n Continue until all p principal components are computed Covariance matrix of p variables 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 15

Procedures 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D

Procedures 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 16

Procedures 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D

Procedures 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 17

Procedures n n Different variables might have different units and magnitudes PCA might be

Procedures n n Different variables might have different units and magnitudes PCA might be influenced by these magnitudes and units Standardization to have zero mean and unit variance Covariance on standardized variables is the correlation matrix 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 18

Procedures n Steps of (PCA) n n Standardizing variables X 1, X 2, …,

Procedures n Steps of (PCA) n n Standardizing variables X 1, X 2, …, Xp to have zero means and unit variances unless that the importance of variables is reflected in their variances Calculate the covariance matrix (correlation matrix) 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 19

Procedures n Steps of (PCA) n n n Find the eigenvalues 1, 2, …,

Procedures n Steps of (PCA) n n n Find the eigenvalues 1, 2, …, p and their corresponding eigenvectors a 1, a 2, …, ap The coefficients of the ith principal component Zi is the element of ai and i the variance of Zi Discard any components that accounts for only a small proportion of the variation in the data 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 20

Properties 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D

Properties 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 21

Properties n n n E(Z)=A V(Z)=A A’=diag{ I, i=1, …, p} Cov(Zi, Xj)=aij i

Properties n n n E(Z)=A V(Z)=A A’=diag{ I, i=1, …, p} Cov(Zi, Xj)=aij i Corr(Zi, Xj)=aij i/cjj Corr(Zi, Xj)=aij i, if correlation matrix is used 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 22

Examples n Determination of the number of principal components n n Depends upon the

Examples n Determination of the number of principal components n n Depends upon the needs of practitioners The proportion of the total variation explained by the selected principal components is high, e. g. , at least 80% If correlation matrix is used, select the principal component with the variance greater than 1 because they accounts for more variation than the original variables (=1) Use scree plot 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 23

Examples n Evaluation of Statistics Course n n n 16 students for 11 items

Examples n Evaluation of Statistics Course n n n 16 students for 11 items (variables) Evaluation scales: 1(poor or not at all) to 5(excellent, strongly, or difficult) The first two principal components explain 76. 0% of total variation and the last four principal components explain only 2. 2% 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 24

Examples 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D

Examples 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 25

Examples n Test scores of 10 students in 4 subjects Subject Chinese(X 1) English(X

Examples n Test scores of 10 students in 4 subjects Subject Chinese(X 1) English(X 2) Math(X 3) Social(X 4) 1 85 76 60 85 2 90 95 80 72 Student 3 4 60 70 45 65 38 60 80 76 5 68 56 70 70 6 77 80 65 68 7 50 30 40 80 8 80 70 60 66 9 85 75 65 84 10 55 60 40 50 Source: Shen (1998) 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 26

Examples X 1 X 2 X 3 X 4 12/31/2021 X 1 1 Correlation

Examples X 1 X 2 X 3 X 4 12/31/2021 X 1 1 Correlation Matrix X 2 X 3 X 4 0. 8846 0. 8375 0. 2784 1 0. 8059 -0. 1101 1 0. 1118 1 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 27

Examples n Eigenvalues and Eigenvectors Eigenvalue 2. 70159 1. 06380 0. 19870 0. 03591

Examples n Eigenvalues and Eigenvectors Eigenvalue 2. 70159 1. 06380 0. 19870 0. 03591 12/31/2021 Prop. 0. 6754 0. 2660 0. 0497 0. 0090 Cum. Eigenvector Prop. X 1 X 2 0. 6554 0. 5897 0. 1254 0. 9414 0. 1254 -0. 2651 0. 9910 0. 3592 0. 4378 1. 0000 -0. 7124 0. 6444 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D X 3 X 4 0. 3592 -0. 7124 -0. 0281 0. 9556 -0. 8227 0. 0501 0. 0485 0. 2737 28

Examples n n n Because the first two principal components account for 94. 14%,

Examples n n n Because the first two principal components account for 94. 14%, we can just use these two principal components The first principal component can be interpreted as the index for the sum of Chinese, English and math The second principal component can be thought as social science 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 29

Examples n n n The above results can be also obtained by inspecting the

Examples n n n The above results can be also obtained by inspecting the correlation matrix Correlations among Chinese, English, and math exceed 0. 8 Correlations between Chinese, English, and math with social science are below 0. 3 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 30

Examples Correlation between the first principal component with original variables Corr(Z 1, X 1)=a

Examples Correlation between the first principal component with original variables Corr(Z 1, X 1)=a 11 1 =0. 5897 2. 70159=0. 9692 Corr(Z 1, X 2)=a 12 1 =0. 5682 2. 70159=0. 9339 Corr(Z 1, X 3)=a 13 1 =0. 5657 2. 70159=0. 9298 Corr(Z 1, X 4)=a 14 i = 0. 0969 2. 70159=0. 1592 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 31

Examples Correlation between the second principal component with original variables Corr(Z 2, X 1)=a

Examples Correlation between the second principal component with original variables Corr(Z 2, X 1)=a 21 2 Corr(Z 2, X 2)=a 22 2 Corr(Z 2, X 3)=a 23 2 Corr(Z 2, X 4)=a 24 2 12/31/2021 =0. 1254 1. 0638=0. 1294 =-0. 2651 1. 0638=-0. 2734 =-0. 0281 1. 0638=-0. 0290 = 0. 9556 1. 0638=0. 9856 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 32

Examples Student 1 2 3 4 5 6 7 8 9 10 12/31/2021 1

Examples Student 1 2 3 4 5 6 7 8 9 10 12/31/2021 1 st Component 0. 91883 2. 58868 -1. 85920 0. 03527 0. 01741 0. 92643 -2. 67248 0. 52758 1. 32646 -1. 80897 2 nd Component 1. 12685 -0. 41488 0. 84509 0. 23932 -0. 21745 -0. 65337 0. 96553 -0. 65459 0. 92471 -0. 16121 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 33

Examples Correlations of Female Sparrows X 1 X 2 X 3 X 4 X

Examples Correlations of Female Sparrows X 1 X 2 X 3 X 4 X 5 Total length (X 1) 1. 000 Alar length (X 2) 0. 735 1. 000 Length of beak and Head (X 3) 0. 662 0. 674 1. 000 Length of humerus (X 4) 0. 645 0. 769 0. 763 1. 000 Length of keel of sternum (X 5) 0. 605 0. 529 0. 626 0. 607 1. 000 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 34

Examples Component Variance 1 3. 616 2 0. 532 3 0. 386 4 0.

Examples Component Variance 1 3. 616 2 0. 532 3 0. 386 4 0. 302 5 0. 165 12/31/2021 Coefficients for Components X 1 X 2 X 3 X 4 X 5 0. 452 0. 462 0. 451 0. 471 0. 398 -0. 051 0. 300 0. 325 0. 185 -0. 877 0. 691 0. 341 -0. 455 -0. 411 -0. 179 -0. 420 0. 548 -0. 606 0. 388 0. 069 0. 374 -0. 530 -0. 343 0. 652 -0. 192 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 35

Examples n The first principal component Z 1=0. 452 X 1+0. 462 X 2+0.

Examples n The first principal component Z 1=0. 452 X 1+0. 462 X 2+0. 451 X 3+0. 471 X 4+0. 398 X 5 n n An index of bird size The second principal component Z 2=-0. 051 X 1+0. 300 X 2+0. 325 X 3+0. 185 X 4 -0. 877 X 5 n An index of bird shape 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 36

Examples n The value of the first principal component for the first bird Z

Examples n The value of the first principal component for the first bird Z 1=0. 452(-0. 542)+0. 462(0. 725)+0. 451(0. 177)+ 0. 471(0. 055)+0. 398(-0. 33) = 0. 064 n The value of the second principal component for the first bird Z 2=-0. 051(-0. 542)+0. 300(0. 725)+0. 325(0. 177)+ 0. 185(0. 055)+(-0. 877(-0. 33) = 0. 602 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 37

Examples Mean 1 2 3 4 5 Survivor -0. 100 0. 004 -0. 140

Examples Mean 1 2 3 4 5 Survivor -0. 100 0. 004 -0. 140 0. 073 0. 023 12/31/2021 Standard Deviation Nonsurvivor 0. 075 -0. 003 0. 105 -0. 055 -0. 017 Survivor 1. 506 0. 684 0. 522 0. 563 0. 411 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D Nonsurvivor 2. 176 0. 776 0. 677 0. 543 0. 408 38

Examples n Employment in European Countries AGR MIN MAN PS CON SER FIN SPC

Examples n Employment in European Countries AGR MIN MAN PS CON SER FIN SPC TC AGR 1. 000 MIN 0. 316 1. 000 MAN -0. 254 -0. 672 1. 000 PS(3) -0. 382 -0. 387 0. 388 1. 000 CON -0. 349 -0. 129 -0. 034 0. 165 1. 000 SER -0. 605 -0. 407 -0. 033 0. 155 0. 473 1. 000 FIN -0. 176 -0. 248 -0. 274 0. 094 -0. 018 0. 379 1. 000 SPC -0. 811 -0. 316 0. 050 0. 238 0. 072 0. 388 0. 166 1. 000 TC -0. 487 0. 045 0. 243 0. 105 -0. 055 -0. 085 -0. 391 0. 475 1. 000 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 39

Examples n 9 eigenvalues: 3. 112(34. 6%), 1. 809(20. 1%), 1. 496(16. 6%), 1.

Examples n 9 eigenvalues: 3. 112(34. 6%), 1. 809(20. 1%), 1. 496(16. 6%), 1. 063(11. 8%), 0. 710(7. 9%) 0. 311(3. 5%), 0. 293(3. 3%), 0. 204(2. 4%), and 0(0. 0%) The sum of percent employment is 1 The columns of correlation matrix are linearly dependent The last eigenvalue is 0 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 40

Examples n n Select the principal components with eigenvaleues greater than 1 the first

Examples n n Select the principal components with eigenvaleues greater than 1 the first 4 principal components that explain 85% of the total variation in the data If we take first two principal components which can account only for 55% of total variation 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 41

Examples n The first principal component Z 1=0. 51(AGR)+0. 37(Min)-0. 25(MAN)0. 31(PS)-0. 22(CON)-0. 38(SER)-0.

Examples n The first principal component Z 1=0. 51(AGR)+0. 37(Min)-0. 25(MAN)0. 31(PS)-0. 22(CON)-0. 38(SER)-0. 13(FIN)0. 42(SPS)-0. 21(TC) n A contrast between AGR(agriculture, forestry, and fishing) and MIN(mining and quarrying) versus others 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 42

Examples n The second principal component Z 1=-0. -2(AGR)+0. 00(Min)+0. 43(MAN) +0. 11(PS)-0. 24(CON)-0.

Examples n The second principal component Z 1=-0. -2(AGR)+0. 00(Min)+0. 43(MAN) +0. 11(PS)-0. 24(CON)-0. 41(SER) -0. 55(FIN)+0. 05(SPS)+0. 52(TC) n A contrast between MAN(manufacturing) and TC(transport and communication) versus CON(construction), SER(service industry) and FIN(finance) 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 43

12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 44

12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 44

12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 45

12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 45

Summary n n A linear combination of the original variables Try to reduce a

Summary n n A linear combination of the original variables Try to reduce a large number of variables to a few index variables Index variables are not correlated and ordered in the magnitude of variation Illustration with real examples 12/31/2021 Copyright by Jen-pei Liu, Ph. D and Wei-Chi Chie, MD, Ph. D 46