Statistical Analysis of c DNA microarrays II Terry

  • Slides: 23
Download presentation
Statistical Analysis of c. DNA microarrays II Terry Speed Department of Statistics, University of

Statistical Analysis of c. DNA microarrays II Terry Speed Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Outline Different types of questions asked in microarray experiments Cluster analysis Single gene method

Outline Different types of questions asked in microarray experiments Cluster analysis Single gene method A synthesis Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Gene Expression Data Gene expression data on p genes for n samples m. RNA

Gene Expression Data Gene expression data on p genes for n samples m. RNA samples sample 1 sample 2 sample 3 sample 4 sample 5 … Genes 1 2 3 4 5 0. 46 -0. 10 0. 15 -0. 45 -0. 06 0. 30 0. 49 0. 74 -1. 03 1. 06 0. 80 0. 24 0. 04 -0. 79 1. 35 1. 51 0. 06 0. 10 -0. 56 1. 09 0. 90 0. 46 0. 20 -0. 32 -1. 09 . . . . Gene expression level of gene i in m. RNA sample j = Log( Red intensity / Green intensity) Log(Avg. PM - Avg. MM) Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Experiments, horses for courses m. RNA levels compared in many different contexts — Tumour

Experiments, horses for courses m. RNA levels compared in many different contexts — Tumour cell lines — Different tissues, same organism — Same tissue, different organisms (wt, ko, tg) — Same tissue, same organism (trt vs ctl) — Time course experiments No single method of analysis can be appropriate for all. Rather, each type of experiment requires its own analysis. Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Cluster Analysis Can cluster genes, cell samples, or both. Strengthens signal when averages are

Cluster Analysis Can cluster genes, cell samples, or both. Strengthens signal when averages are taken within clusters of genes (Eisen). Useful (essential ? ) when seeking new subclasses of cells, tumours, etc. Leads to readily interpreted figures. Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Clusters Taken from Nature February, 2000 Paper by Allzadeh. A et al Distinct types

Clusters Taken from Nature February, 2000 Paper by Allzadeh. A et al Distinct types of diffuse large B-cell lymphoma identified by Gene expression profiling, Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Discovering sub-groups Department of Statistics, University of California, Berkeley , and Division of Genetics

Discovering sub-groups Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Which genes have changed? This is a common enough question. We will illustrate one

Which genes have changed? This is a common enough question. We will illustrate one approach when replicates are available. GOAL: Identify genes with altered expression in the livers of one line of mice with very low HDL cholesterol levels compared to inbred control mice. Experiment: Apo AI knock-out mouse model 8 knockout (ko) mice and 8 control (ctl) mice (C 57 Bl/6). 16 hybridisations: m. RNA from each of the 16 mice is labelled with Cy 5, pooled m. RNA from control mice is labelled with Cy 3. Probes: ~6, 000 c. DNAs, including 200 related to lipid metabolism. Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Which genes have changed? 1. For each gene and each hybridisation (8 ko +

Which genes have changed? 1. For each gene and each hybridisation (8 ko + 8 ctl), use M=log 2(R/G). 2. For each gene form the t statistic: average of 8 ko Ms - average of 8 ctl Ms sqrt(1/8 (SD of 8 ko Ms)2 + (SD of 8 ctl Ms)2) 3. Form a histogram of 6, 000 t values. 4. Do a normal Q-Q plot; look for values “off the line”. 5. Adjust for multiple testing. Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Histogram Apo. A 1 Department of Statistics, University of California, Berkeley , and Division

Histogram Apo. A 1 Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Plot of t-statistics Department of Statistics, University of California, Berkeley , and Division of

Plot of t-statistics Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Assigning p-values to measures of change • Estimate p-values for each comparison (gene) by

Assigning p-values to measures of change • Estimate p-values for each comparison (gene) by using the permutation distribution of the tstatistics. • For each of the possible permutation of the trt / ctl labels, compute the two -sample t-statistics t* for each gene. • The unadjusted p-value for a particular gene is estimated by the proportion of t*’s greater than the observed t in absolute value. Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Multiple Testing Problem: We have just performed ~6000 tests! => need to control the

Multiple Testing Problem: We have just performed ~6000 tests! => need to control the family-wise false positive rate (Type I error). => use adjusted p-values. Bonferroni adjustment. Multiply p-values by number of tests. Too conservative, doesn’t take into account the dependence structure between the genes. Westfall & Young. Estimate adjusted p-values using the permutation distribution of statistics which take into account the dependence structure between the genes. Less conservative. Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Apo A 1: Adjusted and Unadjusted p-values for the 50 genes with the larges

Apo A 1: Adjusted and Unadjusted p-values for the 50 genes with the larges absolute t-statistics. Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Apo AI. Genes with adjusted p-value < 0. 01 Department of Statistics, University of

Apo AI. Genes with adjusted p-value < 0. 01 Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Limitations Cluster analyses: 1) Usually outside the normal framework of statistical inference; 2) less

Limitations Cluster analyses: 1) Usually outside the normal framework of statistical inference; 2) less appropriate when only a few genes are likely to change. 3) Needs lots of experiments Single gene tests: 1) may be too noisy in general to show much 2) may not reveal coordinated effects of positively correlated genes. 3) hard to relate to pathways. Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

A synthesis We and others (Stanford) are working on methods which try to combine

A synthesis We and others (Stanford) are working on methods which try to combine the best of both of the preceding approaches. Try to find clusters of genes and average their responses to reduce noise and enhance interpretability. Use testing to assignificance with averages of clusters of genes as we did with single genes. Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Clustering genes E. g. p=5 Cluster 6=(1, 2) Cluster 7=(1, 2, 3) Cluster 8=(4,

Clustering genes E. g. p=5 Cluster 6=(1, 2) Cluster 7=(1, 2, 3) Cluster 8=(4, 5) Cluster 9= (1, 2, 3, 4, 5) 1 2 3 4 5 Let p = number of genes. 1. Calculate within class correlation. 2. Perform hierarchical clustering which will produce (2 p-1) clusters of genes. 3. Average within clusters of genes. 4 Perform testing on averages of clusters of genes as if they were single genes. Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Data - Ro 1 Transgenic mice with a modified Gi coupled receptor (Ro 1).

Data - Ro 1 Transgenic mice with a modified Gi coupled receptor (Ro 1). Experiment: induced expression of Ro 1 in mice. 8 control (ctl) mice 9 treatment mice eight weeks after Ro 1 being induced. Long-term question: Which groups of genes work together. Based on paper: Conditional expression of a Gi-coupled receptor causes ventricular conduction delay and a lethal cardiomyopathy, see Redfern C. et al. PNAS, April 25, 2000. http: //www. pnas. org also http: //www. Gen. MAPP. org/ (Conklin lab, UCSF) Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Histogram Cluster of genes (1703, 3754) Department of Statistics, University of California, Berkeley ,

Histogram Cluster of genes (1703, 3754) Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Top 15 averages of gene clusters T -13. 4 -12. 1 11. 8 11.

Top 15 averages of gene clusters T -13. 4 -12. 1 11. 8 11. 7 11. 3 11. 2 -10. 7 10. 6 -10. 4 10. 3 Group ID 7869 3754 6175 4689 6089 1683 2272 9955 5179 3916 8255 4772 10548 9476 = (1703, 3754) Might be influenced by 3754 = (6194, 1703, 3754) = (4572, 4772, 5809) Correlation = (2534, 1343, 1954) = (6089, 5455, 3236, 4014) Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Limitation Hard to extend this method to negatively correlated clusters of genes. Need to

Limitation Hard to extend this method to negatively correlated clusters of genes. Need to consider together with other methods. Need to identify high averages of clusters of genes that are due to high averages from subclusters of those genes. Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research

Acknowledgments Yee Hwa Yang Sandrine Dudoit Natalie Roberts Ben Bolstad Ingrid Lonnstedt Karen Vranizan

Acknowledgments Yee Hwa Yang Sandrine Dudoit Natalie Roberts Ben Bolstad Ingrid Lonnstedt Karen Vranizan Matt Callow (LBL) Bruce Conklin (UCSF) WEHI Bioinformatics group Department of Statistics, University of California, Berkeley , and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research