Lecture Topic 5 Preprocessing AFFY data Probe Level

Lecture Topic 5 Pre-processing AFFY data

Probe Level Analysis • The Purpose – Calculate an expression value for each probe set (gene) from the 11 -25 PM and MM intensities – Critical for later analysis. Avoiding GIGO

Difficulties • Large variability • Few measurements (11 -25) at most • MM is very complex, it is signal plus background • Signal has to be SCALED • Probe-level effects

Different Methods • • MAS 4 Affymetrix 1996 MAS 5 Affymetrix 2002 Robust Multichip Analysis (RMA) 2002 GC-RMA 2004

MAS 4 A- probe pairs selected

Avg Diff • Calculated using differences between MM and PM of every probe pair and averaging over the probe pair – Excluded OUTLIER pairs if PM-MM > 3 SD – Was NOT a robust average – NOT log-transformed – COULD be negative (about 1/3 of the times)

MAS 5 • • Signal=Tukey. Biweight{log 2(PMj-IMj) Discussed this earlier. Requires calculating IM Adjusted PM-MM are log transformed and robust for outlying observations using Tukey Biweight.

Robust Multichip Analysis ONLY uses PM and ignores MM SACRIFICES Accuracy but major gains in PRECISION • Basic Steps: – 1. Calculate chip background (*BG) and subtract from PM – 2. Carry out intensity dependent normalization for PM-*BG • Lowess • Quantile Normalization (Discussed before) – Normalized PM-*BG are log transformed – Robust multichip analysis of all probes in the set and using Tukey median polishing procedure. Signal is antilog of result.

RMA- Step 1: Background Correction • Irrizary et al(2003) • Looks at finding the conditional expectation of the TRUE signal given the observed signal (which is assumed to be the true signal plus noise) • E(si | si+bi) • Here, si assumed to follow Exponential distribution with parameter q. • Bi assumed to follow N(me, s 2 e) • Estimate me and se as the mean and standard deviation of empty spots

RMA- BG Corrected Value

RMA-Normalization Use the background corrected intensities B(PM) to carry out normalization – Lowess (for Spatial effects) – Quantile Normalization (to allow comparability amongst replicate slides) – Normalized B(PM) are log transformed

RMA summarization • Use MEDIAN POLISH to fit a linear model • Given a MATRIX of data: – Data= overall effects+row effects + column effects + residual • Find row and column effects by subtracting the medians of row and column successively till all the medians are less than some epsilon • Gives estimated row, column and overall effect when done

Median Polish of RMA • For each probe set we have a matrix (probes in rows and arrays in columns) • We assume: • Signal=probe affinity effect + logscale for expression + error • Also assume the sum of probe affinities is 0 • Use MEDIAN polish to estimate the expression level in each array