Linking Genetic Profiles to Biological Outcome Paul Fogel

  • Slides: 22
Download presentation
Linking Genetic Profiles to Biological Outcome Paul Fogel Consultant, Paris S. Stanley Young National

Linking Genetic Profiles to Biological Outcome Paul Fogel Consultant, Paris S. Stanley Young National Institute of Statistical Sciences NISS, NMF Workshop February 23, ‘ 07

Scotch whiskey database Original matrix = Prototypical flavor patterns X Mixing levels (weights) +

Scotch whiskey database Original matrix = Prototypical flavor patterns X Mixing levels (weights) + Residual

How many flavor patterns? Profile likelihood (Zhu and Ghodsi) Scree plot Volume filled (Determinant)

How many flavor patterns? Profile likelihood (Zhu and Ghodsi) Scree plot Volume filled (Determinant)

An. Cnoc Floral Sweetness Fruity Malty Nutty

An. Cnoc Floral Sweetness Fruity Malty Nutty

Balmenach Winey Body Honey Sweetness Nutty Malty

Balmenach Winey Body Honey Sweetness Nutty Malty

Glen. Garioch Spicy Fruity Sweetness Body Malty

Glen. Garioch Spicy Fruity Sweetness Body Malty

Lagavulin & Laphroig Medicinal Smoky Body

Lagavulin & Laphroig Medicinal Smoky Body

Statistical Issues 1. Massive testing: Hundreds of “omic” predictors and several questions per sample.

Statistical Issues 1. Massive testing: Hundreds of “omic” predictors and several questions per sample. 2. Family-wise versus false discovery. 3. Missing data, outliers. Don’t fool yourself.

Matrix Factorization Methods 1. Principle component analysis. 2. Singular value decomposition. 3. Non-negative matrix

Matrix Factorization Methods 1. Principle component analysis. 2. Singular value decomposition. 3. Non-negative matrix factorization. 4. Independent component analysis. 5. Robust MF. Area of active research.

Key Papers 1. Good (1969) Technometrics – SVD. 2. Liu et al. (2003) PNAS

Key Papers 1. Good (1969) Technometrics – SVD. 2. Liu et al. (2003) PNAS – r. SVD. NMF commits one vector to each mechanism. 3. Lee and Seung (1999) Nature – NMF. 4. Kim and Tidor (2003) Genome Research. 5. Brunet et al. (2004) PNAS – Micro array. SVD eigen vectors come from a composite of mechanisms.

NMF Algorithm Samples Genes or Compounds Start with random elements in red and green.

NMF Algorithm Samples Genes or Compounds Start with random elements in red and green. A Optimize so that = WH +E Green are the “spectra”. Red are the “weights”. (aij – whij)2 is minimized.

Inference • Test each variable sequentially within an ordered set. Each set corresponds to

Inference • Test each variable sequentially within an ordered set. Each set corresponds to a particular eigenvector, which has been ordered by decreasing values. Increase in statistical power. Genomic example. Simulation.

Micro Array Example • Group AML: patients with acute myeloid leukemia • Group ALL:

Micro Array Example • Group AML: patients with acute myeloid leukemia • Group ALL: patients with acute lymphoblastic leukemia – Subgroup ALL-T: T cell subtypes – Subgroup ALL-B: B cell subtypes Golub, T. R. et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531– 537.

Clustering NMF clusters samples correctly. Brunet et al (2004). PNAS vol. 101 no. 12

Clustering NMF clusters samples correctly. Brunet et al (2004). PNAS vol. 101 no. 12 4164– 4169 Additional subgroup of ALL-B.

Clustering NMF clusters samples correctly. Additional subgroup of ALL-B. Brunet et al (2004). PNAS

Clustering NMF clusters samples correctly. Additional subgroup of ALL-B. Brunet et al (2004). PNAS vol. 101 no. 12 4164– 4169

Clustering NMF clusters samples correctly. Additional subgroup of ALL-B. Brunet et al (2004). PNAS

Clustering NMF clusters samples correctly. Additional subgroup of ALL-B. Brunet et al (2004). PNAS vol. 101 no. 12 4164– 4169

Sequential testing Cluster 1 ALL-B 1 (33 genes) Immune Response MHC class II 10

Sequential testing Cluster 1 ALL-B 1 (33 genes) Immune Response MHC class II 10 genes (p=0. 00019) 5 genes Proteasome 7 genes P = 0. 00054 Immune Response 28 genes (p=0. 00047) MHC class I & II 6 genes P = 0. 00018 Upregulation in ALL-B 2 genes Higher rate of transcription and replication processes More: RNA Processing Cluster 3 ALL-B 2 11 genes P = 0. 00260 (169 genes) Cell Growth and Proliferation 61 genes DNA Repair and Replication 11 genes P = 0. 01519 Cell Cycle 12 genes Transcription 16 genes Proliferative nature compared with ALL-B 1 Proteasomal activity Energy production.

Simulation

Simulation

Simulation Genes 1 -5: upregulated by T 1 Genes 6 -10: upregulated by T

Simulation Genes 1 -5: upregulated by T 1 Genes 6 -10: upregulated by T 2 Genes 11 -20: upregulated by T 1 and T 2 Intragroup correlation structure

Simulation results Increased power Same level of FDR For more details see paper

Simulation results Increased power Same level of FDR For more details see paper

Summary • The strategy is conceptually simple: – – – • Non-negative matrix factorization

Summary • The strategy is conceptually simple: – – – • Non-negative matrix factorization is used to create groups of genes that are moving together in the dataset. The error rate to be controlled is allocated over these groups. Within each group, genes are tested sequentially. The strategy should be effective if there are sets of genes moving together so that group formation reflects biological reality. Areas of research: Robust algorithms Speed Multiblock NMF (e. g. relate active motifs with differentially expressed genes)

Contact Information Paul Fogel paul. fogel@wanadoo. fr +33 1 43 26 16 86 Independent

Contact Information Paul Fogel paul. [email protected] fr +33 1 43 26 16 86 Independent consultant Stan Young National Institute of Statistical Sciences [email protected] org 919 685 9328 Literature www. niss. org/ir. MF Software