CSCBB 545 Data Mining Spectral Methods PCA SVD
CS/CBB 545 - Data Mining Spectral Methods (PCA, SVD) #2 - Application Mark Gerstein, Yale University gersteinlab. org/courses/545 (class 2007, 03. 08 14: 30 -15: 45) 1
Intuition on interpretation of SVD in terms of genes and conditions 2
SVD for microarray data (Alter et al, PNAS 2000) 3
4
Notation • m=1000 genes – row-vectors – 10 eigengene (vi) of dimension 10 conditions • n=10 conditions (assays) – column vectors – 10 eigenconditions (ui) of dimension 1000 genes 5
Understanding Eigengenes (vi) in terms PCA on (large) gene-gene correlation matrix 6
Understanding Eigenconditions (ui) in terms of PCA on (small) condition-condition correlation matrix Bra - ket notation 7
Plotting Experiments in Low Dimension Subspace 8
Close up on Eigengenes 9
Genes sorted by correlation with top 2 eigengenes Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101 -10106 10 Copyright © 2000 by the National Academy of Sciences
Same thing different experiment: Genes sorted by relative correlation with first two eigengenes for alpha-factor experiment Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101 -10106 11 Copyright © 2000 by the National Academy of Sciences
Normalized elutriation expression in the subspace associated with the cell cycle Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101 -10106 12 Copyright © 2000 by the National Academy of Sciences
See grouping of arrays and genes on same plot Biplot Applied to Genes and Conditions 13
(c) M Gerstein '06, gerstein. info/talks Spectral Biclustering 14
Biclustering to associate particular genes with certain phenotypes Conditions Reordered Conditions (Sorted according to a classification vector) (c) M Gerstein '06, gerstein. info/talks Genes ? (containing checkerboard “biclusters” of conditions with marker genes) (Sorted according to a classification vector) Matrix of raw data Reordered Genes Shuffled Matrix 15
Pomeroy et. al. , Nature 415 (2002) 436 Prediction of central nervous system embryonal tumor outcome based on gene expression (c) M Gerstein '06, gerstein. info/talks 5 types of brain tumors 16
Intuition on Identification of Blocky Matrices (c) M Gerstein '06, gerstein. info/talks 2 17
tumor 1 tumor 2 (c) M Gerstein '06, gerstein. info/talks Gene cluster 2 Gene cluster 1 Gene partition vector tumor 3 18
(c) M Gerstein '06, gerstein. info/talks Tissue partition vector 19
(c) M Gerstein '06, gerstein. info/talks Biclustering by SVD 20
Identify checkerboard matrices by their action on classification vectors: Formulation as “eigenproblem” Gene Classification Vector A A x = x’ T A A y = y’ T y A y Genes Condition Classification Vect. x Genes x’ (c) M Gerstein '06, gerstein. info/talks A Conditions Checkerboard Matrix T 21
(c) M Gerstein '06, gerstein. info/talks SVD to Solve Eigenproblem [Botstein] 22
Yuval Kluger et al. Genome Res. 2003; 13: 703 -716 (c) M Gerstein '06, gerstein. info/talks Figure 1. Overview of important parts of the biclustering process 23
(c) M Gerstein '06, gerstein. info/talks Gene partition with noisy data 24
(c) M Gerstein '06, gerstein. info/talks Normalization Rescales Rows and Columns to Same Means 25
(c) M Gerstein '06, gerstein. info/talks Rescale columns 26
• Lymphoma Data from Dalla-Favera et al. at Columbia • Informatics from Stolovitzky & Califano at IBM • Supervised learning some identified characteristic genes associated with different types of lymphoma (c) M Gerstein '06, gerstein. info/talks Representative Cancer Data set 27
Results on Representative Cancer Data set Patients (samples) sorted according to projection onto blocky classification eigenvector (u 2) (c) M Gerstein '06, gerstein. info/talks Genes sorted according to projection onto blocky classification eigenvector (v 2) Matrix values represent outer products of two blocky classification eigenvectors 28
(c) M Gerstein '06, gerstein. info/talks Actual Data with Normalization and Sorting 29
Actual Data just with Sorting (c) M Gerstein '06, gerstein. info/talks (no normalization) 30
Actual Data (c) M Gerstein '06, gerstein. info/talks (no normalization or sorting) 31
Actual Data just with Sorting (c) M Gerstein '06, gerstein. info/talks (no normalization) 32
(c) M Gerstein '06, gerstein. info/talks Actual Data with Normalization and Sorting 33
Patients (samples) sorted according to projection onto blocky classification eigenvector (u 2) (c) M Gerstein '06, gerstein. info/talks Genes sorted according to projection onto blocky classification eigenvector (v 2) Matrix values represent outer products of two blocky classification eigenvectors Just signal from top classification eigenvectors 34
(c) M Gerstein '06, gerstein. info/talks Low Dimension Representation 35
Patients (samples) sorted according to projection onto blocky classification eigenvector (u 2) (c) M Gerstein '06, gerstein. info/talks Genes sorted according to projection onto blocky classification eigenvector (v 2) Actual Values of Projections onto Classification Eigenvectors 36
Straight SVD Normalized (“bistochastization”) Four types of Cancer in Della Favera dataset CLL DLCL FL DLCL (c) M Gerstein '06, gerstein. info/talks Classification of Cancers Based on Projection onto two top classification eigenvectors: Better with Normalization 37
Golub, TR et. al. , Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 1999 286 biclustering bistochastization ALL (B) ALL (T) AML bi-normalization Normalized cuts (c) M Gerstein '06, gerstein. info/talks SVD 38
- Slides: 38