ECS 289 A Presentation Jimin Ding Problem Motivation

ECS 289 A Presentation Jimin Ding • • Problem & Motivation Two-component Model Estimation for Parameters in above model Define low and high level gene expression Comparing expression levels Limitations of the model and method Other possible solutions References

A Model for Measurement Error for Gene Expression Arrays David Rocke & Blythe Durbin Journal of Computational Biology Nov. 2001

Problem & Motivation • Statistical inference for data need assumption of normality with constant variance --- So hypothesis testing for the difference between control and treatment need equal variance (not depending on the mean of the data); • Measurement error for gene expression rises proportionately to the expression level --- So linear regression fails and log transformation has been tried; • However, for genes whose expression level is low or entirely unexpressed, the measurement error doesn’t go down proportionately Example --- So log transformation fails by inflating the variance of observations near background, and two component model is introduced.

Example: Mice From: Barosiewics etatl, 2000

From Durbin et. al 2002 back

Two-Component Model • Y is the intensity measurement • is the expression level in arbitrary units • is the mean intensity of unexpressed genes • Error term:

Estimation for background ( ) • Estimation of background using negative controls • Estimation of background with replicate measurements Detail • Estimation of background without replicate

Estimation of with replicate measurements • Begin with a small subset of genes with low intensity (10%) • Define a new subset consisting of genes whose intensity values are in • Repeat the first and second steps until the set of genes does not change.

Estimation of the High-level RSD • The variance of intensity in two-component model: , where • At high expression level, only multiple error term is noticeable, so the ratio of the variation to the mean is a constant, i. e. RSD= • For each replicated gene that is at high level, compute the mean of the and the standard deviation of • Then use the pooled standard deviation to estimate :

Define “high” and “low” • Low expression level: Most of the variance is due to the additive error component. 95% CI: • High expression level: Most of the variance is due to the multiplicative error component. 95% CI:

Comparing Expression Levels • Common method: standard t-test on ratio of expression for treatment and control (low level), or its logarithm (high level). • Problem: Less effective when gene is expressed at a low level in one condition and high in the other:

Solution consider treatment and control are correlated • Model: • Variation: Background: High-level RSD:

Hypothesis testing (Comparison) • • • Assume the data have been adjusted: Testing: (Gene has same expression level at Control and treatment) Then using the following approximate variance to do standard t-test for log ratio of raw data:

Limitations • No theoretical result for above estimations. (Consistency and asymptotical distribution) • Cutoff point of high level and low level is fairly artificial • The convergence of estimation of background information is heavily dependent on data and initial selection

Literature & Other Possible Solutions for Measurement Error • Chen et al. (1997): measurement error is normally distributed with constant coefficient of variation (CV)—in accord with experience • Ideker et al. (2000) introduce a multiplicative error component (normal) • Newton et al. (2001) propose a gamma model for measurement error. • Durbin et al. (2002) suggest transformation , where • Huber et al. (2002) introduce transformation

References • • Blythe Durbin, Johanna Hardin, Douglas Hawkins, and David Rocke. “A variancestabilizing transformation from gene-expression microarray data”, Bioinformatics, ISMB, 2002. Chen. Y. , Dougherty, E. R. and Bittner, M. L. (1997) “Ratio-based decisions and the quantitative analysis of c. DNA microarray images”, J. Biomed. Opt. , 2, 364374 Wolfgang Huber, Anja von Heydebreck, Martin Vingron (Dec. 2002) “Analysis of microarray gene expression data”, Preprint Wolfgang Huber, Anja von Heydebreck, Holger S¨ultmann, Annemarie Poustka, and Martin Vingron. “Variance stablization applied to microarray data calibration and to the quantification of differential expression”, Bioinformatics, 18 Suppl. 1: S 96–S 104, 2002. ISMB 2002.