Microarray data normalization and data transformation KaLok Ng

Estimating background • most image processing software reports background subtracted values for each feature

Estimating background There a number of schools of thought regarding the estimation of background

Gene expression ratio • Most analysis focus on differences in gene expression, which are

Gene expression ratio • A variety of methods for calculation the expression ratio using

Transformation of the gene expression ratio • Ratios are useful because they allow us

Experimental errors Situation where expression does not correlate with spot intensity • Image processing

Normalization of gene expression • A variety of reasons why the raw measurements of

Normalization of gene expression Data normalization - Assume data have already sorted and there

Total intensity normalization • Assuming [total RNA]target = [total RNA]sample • [Red]target = [Green]sample

Total intensity normalization • The intensities are rescaled such that Gknormal = Ntotal Gk

Scatter plot analysis • A scatter plot analysis of the raw Cy 3 and

Scatter plot of Cy 5 and Cy 3 • It is expect that the

Scatter plot of Cy 5 and Cy 3 data Raw data Normalize data

Stanford Micro. Array Database • http: //genome-www 5. stanford. edu/ • To download the

alpha factor release 013 raw data How the microarray date are represented • Ch

alpha factor release 013 raw data • column M is the ORF Name •

Lowess normalization • It is noted that in a number of reports that the

Lowess normalization Local variation as a function of intensity can be used to identify

Lowess normalization • • Lowess detects systematic deviations in the R-I plot and corrects

Data filtering Why do we need data filtering ? • Filter the dataset to

Data filtering Filtering high intensity data • At the high end spectrum, the array

Use of replicate data • Replication is essential for identifying and reducing the effect

Exercise Calculate the net intensity, normalize net intensity, R/G, normalize R/G ratio for channel

Slides: 24

Download presentation

Microarray data normalization and data transformation Ka-Lok Ng Asia University

Estimating background • most image processing software reports background subtracted values for each feature (spot) on the array by first estimating the background and then subtracting it pixel by pixel. • Local estimate of the background by identifying some number of pixels surrounding each spot and using those to calculate an average or median background level that can be subtracted from each pixel in the spot. – Disadvantage – if the pixels selected contain portions of the target spot, one may overestimate the background significantly http: //www. mathworks. com/company/pressroom/i • Some image processing software allows the mage_library/biotech. html estimation of global background – a single measurement that is used for the entire slide. This requires users to select a large representative area of the slide that is devoid of features. – Disadvantage – it fails to account for local variation in the background fluorescence of the substrate, and as result it may provide either an over- or underestimate of the appropriate background.

Estimating background There a number of schools of thought regarding the estimation of background for DNA microarrays (1) Not to calculate a background value • But rather to try to estimate it using statistical techniques • Argument – subtracting background introduces significant variance into the estimation of hybridization intensity, decreasing the reliability • Although this may not produce any significant changes in relative expression levels between conditions for highly expressed genes, it can have profound effect on estimates for genes expressed at low to moderate levels (2) Chooses to subtract background • There a number of choices – To use the fluorescene of the area surrounding each feature on the array. It uses the ‘naked’ substrate (where no DNA is spotted) to estimate the background for features that contains DNA, and as such neglects contributions from nonspecific by hybridization – An alternative – use negative control features printed on the array – These may provide an estimate of nonspecific hybridizatoin, there are not enough features on the array to provide an estimate of local background variation

Gene expression ratio • Most analysis focus on differences in gene expression, which are reported as a background subtracted ratio of expression levels between a selected query sample and a reference sample. • It is done by taking ratios of the measured expression between two physical states for each gene k on the array, where one designated the query and reference samples R and G respectively.

Gene expression ratio • A variety of methods for calculation the expression ratio using the median, mean and integrated expression measures. If we choose to measure expression using the median pixel value for each feature, then the median expression ratio for a given feature is where and are the median intensity values measured for pixels identified in the feature and background respectively.

Transformation of the gene expression ratio • Ratios are useful because they allow us to measure expression differences in an intuitive way • However, ratios are troublesome because they treat up- and down regulated genes differently. Genes up-regulated by a factor of 2 have an expression ratio of 2, while those downregulated by a factor of 2 have an expression ratio of ½. As result, down-regulated genes are compressed between 1 and 0, while up-regulated genes expand to cover the region between 1 and positive infinity. • A better alternative is to apply a logarithmic transformation, generally using the logarithm base 2. the advantage of this transformation is that it produced a continuous spectrum of values for differentially expressed genes while treating up- and down-regulated genes equivalently. • log 24 = 2, log 2¼ = -2 • Disadvantage ratio remove absolute gene expression information

Experimental errors Situation where expression does not correlate with spot intensity • Image processing gives poor estimation of the mean, median or total intensities • Pixel saturation • Background estimation • Density of fluors in the labelled hybridized molecules is high enough, interaction between the dye molecules can quench fluorescence • Poor labelling or hybridization can result in signals too faint to allow detection of certain expressed genes • Significant cross-hybridization • PCR oligonucleotides may be contaminated with other DNAs and may not bind with only the gene of interest • Oligonucleotide sequences may be incorrectly synthesised • Hybridization between alternative splice forms and members of gene families may cause overestimate fluorescence and therefore expression • Sample handling problems – hypoxia (缺氧), cold, heat shock or cell death of tissue samples can cause genes to be expressed that are not normally present in the tissue we want to study • RNA degradation can skew representation, as some RNAs are more stable than others the best protection against any of these artefacts is to institute validated laboratory protocols The signal for each gene is due to a combination of the hybridization of that gene, plus some non-specific hybridization, from all the other similar sequences, or partial transcripts in the sample, plus noises.

Normalization of gene expression • A variety of reasons why the raw measurements of gene expression for two samples may not be directly comparable • Quantity of starting RNA, [RNA]target ≠ [RNA]sample • Difference in Cy 3 and Cy 5 labeling • Difference in Cy 3 and Cy 5 detection efficiencies Consider a self-self comparison in which the sample is compared with itself expects the measured log 2(ratio=R/G) to be 0 for each gene • However, typically these are found to be distributed with a non-zero mean and standard deviation, indicating that there is a bias to one sample or the other • Normalization is any data transformation that adjusts for these effects and allows the data from two samples to be appropriately compared. • Normalization scales one (R or G) or both (R and G) of the measured expression levels for each gene to make them equivalent, and the expression ratios derived from them

Normalization of gene expression Data normalization - Assume data have already sorted and there are no data missing (delete or by interpolation『數學』以插值法求值) - Data normalization the process of correcting two or more data sets prior to comparing their gene expression values - First step, the background intensity signal (noise) is measured and subtracted from the Cy 3 and Cy 5 channels (background may be constant or it may vary locally) - Second step, apply a local or global normalization (such as overall average) to raw measurements (such as ratio (G/R) or Log 2 (Log 2(G) – Log 2(R) transformed) - Other global normalization method choose a set of housekeeping genes (commonly expressed across all tissues, such as the Human Gene Expression (Hu. GE) index database http: //www. biotechnologycenter. org/hio/

Total intensity normalization • Assuming [total RNA]target = [total RNA]sample • [Red]target = [Green]sample • Under this assumption, a normalization factor can be calculated and used to rescale the intensity for each gene in the array • Assume that the arrayed elements represent a random sampling of the genes in the organism no oversample or undersample the genes in the query or reference sample no bias • The normalization factor Ntotal is calculated by summing the measured intensities in both channels where Gk and Rk are the measured intensities for the kth array feature (such as the Cy 3 and Cy 5 intensities) and Narray is the total number of features (genes) in the microarray, the summation runs over the subset of features selected as normalization standards.

Total intensity normalization • The intensities are rescaled such that Gknormal = Ntotal Gk and Rknormal = Rk and the normalized expression ratio for each feature becomes which adjusts the ratio such that the mean ratio is equal to 1, this transformation is equivalent to subtracting a constant from the logarithm of the expression ratio log 2(Tknormal) = log 2(Tk) – log 2(Ntotal) • Disadvantage of total intensity normalization – insensitive to systematic variation in the log 2(ratio)

Scatter plot analysis • A scatter plot analysis of the raw Cy 3 and Cy 5 values of all the 6912 elements within the 5 gene clusters is shown in Figure 7 (4 plots). • The top two plots represent all the elements and the bottom two depict the genes that show the largest differences in signal. • Most of the genes that are distinct between the samples are expressed at lower levels (low fluorescent signal). These differences were more exaggerated in the a. RNA than the m. RNA sample because the signal-to-noise ratio was typically much greater in the a. RNA sample. http: //www. ambion. com/techlib/tn/95/954. html

Scatter plot of Cy 5 and Cy 3 • It is expect that the Cy 5 and Cy 3 plot should fall on the diagonal (i. e. Cy 5 = Cy 3) • In this plot, most of the points are found below the diagonal suggesting that for most of the points, the values measured on the cy 3 channel are higher than the values measured on the dy 5 channel. • Two possible caused, (1) either the m. RNA labelled with cy 3 was more abundant for most of the genes or (2) the cy 3 dye more efficient for the same amount of m. RNA, i. e. cy 3 tends to have a stronger intensity http: //molas 1. nhri. org. tw/~molas/index. htm? item=example&subitem=analysis 2&PHPSESSID=718975800 da 6 f 79 d 287871 e 77 f 50 b 33 a

Scatter plot of Cy 5 and Cy 3 data Raw data Normalize data

Stanford Micro. Array Database • http: //genome-www 5. stanford. edu/ • To download the yeast cell cycle time series data • Public login go to “Select arrays by Experimenter, Category, Subcategory and Organism” page, • set experimenter=Spellman, Category=cell cycle press display data select alpha factor release sample 013 raw data set decompress the Excel file

alpha factor release 013 raw data How the microarray date are represented • Ch 1 Intensity (Mean) Ch 1 Net (Median) Ch 1 Intensity (Median) % of saturated Ch 1 pixels Std Dev of Ch 1 Intensity Channel 1 Background (Mean) Ch 1 Background (Median) Std Dev of Ch 1 Background Ch 1 Net (Mean) • Ch 2 Intensity (Mean) Ch 2 Net (Median) Ch 2 Intensity (Median) % of saturated Ch 2 pixels Std Dev of Ch 2 Intensity Channel 2 Background (Mean) Ch 2 Background (Median) Std Dev of Ch 2 Background • Ch 2 Normalized Background (Media) Ch 2 Normalized Net (Mean) Ch 2 Normalized Intensity (Mean) Normalized Ch 2 Net (Median) Normalized Ch 2 Intensity (Median) Regression Correlation Diameter of the spot Spot Flag • Log(base 2) of R/G Normalized Ratio (Mean) Log(base 2) of R/G Normalized Ratio (Median) R/G Mean (per pixel) R/G Median (per pixel) % CH 1 PIXELS > BG + 1 SD % CH 1 PIXELS > BG + 2 SD % CH 2 PIXELS > BG + 1 SD % CH 2 PIXELS > BG + 2 SD G/R (Mean) • G/R Normalized (Mean) R/G (Median) Std Dev of pixel intensity ratios R/G Normalized (Mean) 1. Ch 1 Net (Mean) = Ch 1 Intensity (Mean) - Ch 1 Background (Median) 2. Ch 2 Net (Mean) = Ch 2 Intensity (Mean) – Ch 2 Background (Median) 3. Ch 2 Normalized Net (Mean) = Ch 2 Normalized Intensity (Mean) - Ch 2 Normalized Background (Media) 4. R/G (Mean) = Ch 2 Net (Mean) / Ch 1 Net (Mean) = 1/ [G/R (Mean)] 5. R/G Normalized (Mean) = Ch 2 Normalized Net (Mean) / Ch 1 Net (Mean) = 1/ [G/R Normalized (Mean)] 6. Log(base 2) of R/G Normalized Ratio (Mean) = Log 2[number 5] = Log [R/G Normalized (Mean)]

alpha factor release 013 raw data • column M is the ORF Name • Column O = Sequence Type = control, AORF or ORF • Box AV 94 Ch 1 Net (Mean) is an empty entry Ch 1 Intensity (Mean) Box AN 94 = 264, Ch 1 Background (Median) Box AT 94 =280, because 264 – 280 < 0 !!

Lowess normalization • It is noted that in a number of reports that the log 2(ratio) values often have a systematic dependence on intensity, most often a deviation from 0 for low intensity spots. • Lowess normalization – a method that can remove intensity-dependent dye-specific effects in the log 2(ratio) values Method • Plotting the measured log 2(R/G) ratios for each array feature as a function of the log 10(R*G) or An R-I plot displays the log 2(Ri/Gi) ratio for each element on the array as a ½*log 2(R*G) product intensities. The resulting function of the log 10(Ri*Gi) product intensities and can reveal systematic ‘R-I’ (for ratio-intensity) can reveal intensity-dependent effects in the measured log 2(ratio) values. Data shown specific artefacts (人作品) in the here are for a 27, 648 -element spotted mouse c. DNA array. http: //www. nature. com/ng/journal/v 32/n 4 s/fig_tab/ng 1032_F 1. html measurement of the ratio. • Self-self hybrisization test -It is expect to see absolutely no differential expression and consequently all log 2(ratio) measures should be zero on average. • The R-I plot shows a slight downward and upward curvature at the low and high intensities ends respectively, as well as an increased spread in the distribution of log 2(ratio) values at low intensities. • Lowess corrects these by performing a local weighted linear regression for each data point in the R-I plot and subtracting the calculated Application of local (pen group) lowess can correct for both systematic best-fit average log 2(ratio) from the variation as a function of intensity and spatial variation between spotting pens experimentally observed ratio for each point as on a DNA microarray. Here, the data from Fig. 1 has been adjusted using a a function of the log 10(intenisty). lowess in the TIGR MIDAS software (with a tri-cube weight function and a 0. 33 smooth parameter) available from http: //www. tigr. org/software. See http: //www. nature. com/ng/journal/v 32/n 4 s/fig_tab/ng 1032_F 2. html

Lowess normalization Local variation as a function of intensity can be used to identify differentially expressed genes by calculating an intensitydependent Z-score. In this R-I plot, array elements are color-coded depending on whether they are less than one standard deviation from the mean (blue), between one and two standard deviations (red), or more than two standard deviations from the mean (green). http: //www. nature. com/ng/journal/v 32/n 4 s/fig_tab/ng 1032_F 4. html

Lowess normalization • • Lowess detects systematic deviations in the R-I plot and corrects them by performing a local weighted linear regression as a function of the log 2(intensity) and subtracting the calculated best-fit average log 2(ratio) from the experimentally observed ratio for each data point. If one sets Lowess estimates y(xi), the dependence of the log 2(ratio) on the log 2(intensity), and then uses this function, point by point, to correct the measured log 2(ratio) values, so that after Lowess correction

Data filtering Why do we need data filtering ? • Filter the dataset to reduce its complexity and increase its overall quality What kind of data are filtered ? • Low quality data – filtering low and high intensity data Filtering low intensity data • By examining several representative R-I plots, it is obvious that the variability in the measured log 2(ratio) values increases as the measured intensity decreases • This is because relative error increases at lower intensity (relative error = error/(raw intensity)↑as raw intensity↓) • Good elements would have intensities more than two standard deviations above background. Only spots where the measured signals meet the criteria that (~ 95. 4%) and similarly for the mean. Other approaches include the use of absolute lower thresholds for acceptable array elements (sometimes referred to as “floor”) or percentage-based cutoffs in which some fixed fraction of elements is discarded.

Data filtering Filtering high intensity data • At the high end spectrum, the array elements saturate the detector used to measure fluorescence intensity • As soon as the intensity approaches its maximum value (typically 216 – 1 = 65535/pixel for a 16 -bit scanner), comparisons are no longer are meaningful, because the array elements become saturated and intensity measurements cannot go higher. • There a variety of approaches to dealing with this problem as well, including eliminating saturated pixels in the image-processing stip or setting a maximum acceptable value (often referred to as ceiling) for each array element.

Use of replicate data • Replication is essential for identifying and reducing the effect of variability in any experimental assay Two broad classes • Biological replicates use independently derived RNA from distinct biological sources to provide an assessment of both the variability in the assay and the biological sample • Technical replicates use repeated measurements of the same biological samples, which provide only information on the variability of the spots. These include – Printing exactly the same DNA at different locations on the same slide control the noise introduced by the location of the spot on the slide (could be affected by local defects !) – Print several exact copies and hybridize several times with exactly the same m. RNA in exactly the same conditions control the noise introduced by the hybridization stage

Exercise Calculate the net intensity, normalize net intensity, R/G, normalize R/G ratio for channel 1 (green) and channel 2 (red) respectively