Dimension reduction PCA and Clustering by Agnieszka S

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman

The DNA Array Analysis Pipeline Question Experimental Design Array design Probe design Sample Preparation Hybridization Buy Chip/Array Image analysis Normalization Expression Index Calculation Comparable Gene Expression Data Statistical Analysis Fit to Model (time series) Advanced Data Analysis Clustering Meta analysis PCA Classification Promoter Analysis Survival analysis Regulatory Network

Motivation: Multidimensional data 209619_at 32541_at 206398_s_at 219281_at 207857_at 211338_at 213539_at 221497_x_at 213958_at 210835_s_at 209199_s_at 217979_at 201015_s_at 203332_s_at 204670_x_at 208788_at 210784_x_at 204319_s_at 205049_s_at 202114_at 213792_s_at 203932_at 203963_at 203978_at 203753_at 204891_s_at 209365_s_at 209604_s_at 211005_at 219686_at 38521_at 217853_at 217028_at 201137_s_at 202284_s_at 201999_s_at Pat 1 7758 280 1050 391 1425 37 124 120 179 203 758 570 533 649 5577 648 142 298 3294 833 646 1977 97 315 1468 78 472 772 49 694 775 367 4926 4733 600 897 Pat 2 4705 387 835 593 977 27 197 86 225 144 1234 563 343 354 3216 327 151 172 1351 674 375 1016 63 279 1105 71 519 74 58 342 604 168 2667 2846 1823 959 Pat 3 5342 392 1268 298 2027 28 454 175 449 197 833 972 325 494 5323 1057 144 200 2080 733 370 2436 77 221 381 152 365 130 129 345 305 107 3542 1834 1657 800 Pat 4 7443 238 1723 265 1184 38 116 99 174 314 1449 796 270 554 4423 746 173 298 2066 1298 436 1856 136 260 1154 74 349 216 70 502 563 160 5163 5471 1177 808 Pat 5 8747 385 1377 491 939 33 162 115 185 250 769 869 691 710 5771 541 148 196 3726 862 738 1917 85 227 980 127 756 108 56 960 542 287 4683 5079 972 297 Pat 6 4933 329 804 517 814 16 113 80 203 353 1110 494 460 455 3374 270 145 104 1396 371 497 822 74 222 1419 57 528 311 77 403 543 264 3281 2330 2303 1014 Pat 7 7950 337 1846 334 658 36 97 83 186 173 987 673 563 748 4328 361 131 144 2244 886 546 1189 91 232 1253 66 637 80 61 535 725 273 4822 3345 1574 998 Pat 8 5031 163 1180 387 593 23 97 119 185 285 638 1013 321 392 3515 774 146 110 2142 501 406 1092 61 141 554 153 828 235 61 513 587 113 3978 1460 1731 663

Dimension reduction methods • • • Principal component analysis (PCA) Cluster analysis Multidimensional scaling Correspondance analysis Singular value decomposition

Principal Component Analysis (PCA) • used for visualization of complex data • developed to capture as much of the variation in data as possible

Principal components • 1. principal component (PC 1) – the direction along which there is greatest variation • 2. principal component (PC 2) – the direction with maximum variation left in data, orthogonal to the 1. PC

Principal components • General about principal components – – summary variables linear combinations of the original variables uncorrelated with each other capture as much of the original variance as possible

Principal components

PCA - example

PCA on all Genes Leukemia data, precursor B and T Plot of 34 patients, dimension of 8973 genes reduced to 2

PCA of genes (Leukemia data) Plot of 8973 genes, dimension of 34 patients reduced to 2

Principal components - Variance

Clustering methods • Hierarchical – agglomerative (buttom-up) eg. UPGMA - divisive (top-down) • Partitioning – eg. K-means clustering

Hierarchical clustering • Representation of all pairwise distances • Parameters: none (distance measure) • Results: – in one large cluster – hierarchical tree (dendrogram) • Deterministic

Hierarchical clustering – UPGMA Algorithm • • Assign each item to its own cluster Join the nearest clusters Reestimate the distance between clusters Repeat for 1 to n

Hierarchical clustering

Hierarchical clustering Data with clustering order and distances Dendrogram representation

Leukemia data - clustering of patients

Leukemia data - clustering of patients on top 100 significant genes

Leukemia data - clustering of genes

K-means clustering • Partition data into K clusters • Parameter: Number of clusters (K) must be chosen • Randomilized initialization: – different clusters each time

K-means - Algorithm • Assign each item a class in 1 to K (randomly) • For each class 1 to K – Calculate the centroid (one of the K-means) – Calculate distance from centroid to each item • Assign each item to the nearest centroid • Repeat until no items are re-assigned (convergence)

K-mean clustering, K=3

K-means clustering of Leukemia data

K-means clustering of Cell Cycle data

Self Organizing Maps (SOM) • Partitioning method (similar to the K-means method) • Clusters are organized in a two-dimensional grid • Size of grid is specified – (eg. 2 x 2 or 3 x 3) • SOM algoritm finds the optimal organization of data in the grid

SOM - example

Comparison of clustering methods • Hierarchical clustering – Distances between all variables – Timeconsuming with a large number of gene – Advantage to cluster on selected genes • K-means clustering – Faster algorithm – Does only show relations between all variables • SOM – more advanced algorithm

Distance measures • Euclidian distance • Vector angle distance • Pearsons distance

Comparison of distance measures

Summary • Dimension reduction important to visualize data • Methods: – Principal Component Analysis – Clustering • Hierarchical • K-means • Self organizing maps (distance measure important)

Coffee break Next: Exercises in PCA and clustering