DNA 2 Last weeks take home lessons z

  • Slides: 54
Download presentation
DNA 2: Last week's take home lessons z Comparing types of alignments & algorithms

DNA 2: Last week's take home lessons z Comparing types of alignments & algorithms z Dynamic programming (DP) z Multi-sequence alignment z Space-time-accuracy tradeoffs z Finding genes -- motif profiles z Hidden Markov Model (HMM) for Cp. G Islands 1

RNA 1: Today's story & goals z Integration with previous topics (HMM & DP

RNA 1: Today's story & goals z Integration with previous topics (HMM & DP for RNA structure) z Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality) z Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays) z Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-poly. A RNAs, effects of array geometry, cross-talk). z Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA). z Time series data: causality, m. RNA decay, time-warping 2

Discrete & continuous bell-curves 3

Discrete & continuous bell-curves 3

gggatttagctcagtt gggagagcgccagact gaa gat ttg gag gtcctgtgttcgatcc acagaattcgcacca Primary to tertiary structure 4

gggatttagctcagtt gggagagcgccagact gaa gat ttg gag gtcctgtgttcgatcc acagaattcgcacca Primary to tertiary structure 4

Non-watson-crick bps -CH 3 ref 5

Non-watson-crick bps -CH 3 ref 5

Modified bases & bps in RNA 1 " ref 72 " 6

Modified bases & bps in RNA 1 " ref 72 " 6

Covariance Ty. C 3’acc anticodon D-stem Mij= Sfxixjlog 2[fxixj/(fxifxj)] x ix j M=0 to

Covariance Ty. C 3’acc anticodon D-stem Mij= Sfxixjlog 2[fxixj/(fxifxj)] x ix j M=0 to 2 bits; x=base type see Durbin et al p. 266 -8. 7

Mutual Information ACUUAU CCUUAG GCUUGC UCUUGA i=1 j=6 Mij= M 1, 6= S =

Mutual Information ACUUAU CCUUAG GCUUGC UCUUGA i=1 j=6 Mij= M 1, 6= S = x 1 x 6 M 1, 2= f. AU log 2[f. AU/(f. A*f. U)]. . . =4*. 25 log 2[. 25/(. 25*. 25)]=2 4*. 25 log 2[. 25/(. 25*1)]=0 Sfxixjlog 2[fxixj/(fxifxj)] x ix j M=0 to 2 bits; x=base type see Durbin et al p. 266 -8. See Shannon entropy, multinomial Grendar 8

RNA secondary structure prediction Mathews DH, Sabina J, Zuker M, Turner DH J Mol

RNA secondary structure prediction Mathews DH, Sabina J, Zuker M, Turner DH J Mol Biol 1999 May 21; 288(5): 911 -40 Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. Each set of 750 generated structures contains one structure that, on average, has 86 % of known base-pairs. 9

Stacked bp & ss 10

Stacked bp & ss 10

Initial 1981 O(N 2) DP methods: Circular Representation of RNA Structure 5’ 3’ Did

Initial 1981 O(N 2) DP methods: Circular Representation of RNA Structure 5’ 3’ Did not handle pseudoknots 11

RNA pseudoknots, important biologically, but challenging for structure searches 12

RNA pseudoknots, important biologically, but challenging for structure searches 12

Dynamic programming finally handles RNA pseudoknots too. Rivas E, Eddy SR J Mol Biol

Dynamic programming finally handles RNA pseudoknots too. Rivas E, Eddy SR J Mol Biol 1999 Feb 5; 285(5): 2053 -68 A dynamic programming algorithm for RNA structure prediction including pseudoknots. (ref) Worst case complexity of O(N 6) in time and O(N 4) in memory space. Bioinformatics 2000 Apr; 16(4): 334 -40 (ref) 13

Cp. G Island + in a ocean of First order Hidden Markov Model MM=16,

Cp. G Island + in a ocean of First order Hidden Markov Model MM=16, HMM= 64 transition probabilities (adjacent bp) P(A+|A+) A+ T+ P( C+ P(G+|C+) > G+ C|A A- T- C- G- +) > 14

Small nucleolar (sno)RNA structure & function Lowe et al. Science (ref) 15

Small nucleolar (sno)RNA structure & function Lowe et al. Science (ref) 15

Sno. RNA Search 16

Sno. RNA Search 16

Performance of RNA-fold matching algorithms Algorithm CPU bp/sec TRNASCAN’ 91 400 TRNASCAN-SE ’ 97

Performance of RNA-fold matching algorithms Algorithm CPU bp/sec TRNASCAN’ 91 400 TRNASCAN-SE ’ 97 30, 000 Sno. RNAs’ 99 True pos. False pos. 95. 1% 99. 5% >93% 0. 4 x 10 -6 <7 x 10 -11 <10 -7 (See p. 258, 297 of Durbin et al. ; Lowe et al 1999) 17

Putative Sno RNA gene disruption effects on r. RNA modification Primer extension pauses at

Putative Sno RNA gene disruption effects on r. RNA modification Primer extension pauses at 2'O-Me positions forming bands at low d. NTP. Lowe et al. Science 1999 283: 1168 -71 (ref) 18

RNA 1: Today's story & goals z Integration with previous topics (HMM & DP

RNA 1: Today's story & goals z Integration with previous topics (HMM & DP for RNA structure) z Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality) z Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays) z Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-poly. A RNAs, effects of array geometry, cross-talk). z Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA). z Time series data: causality, m. RNA decay 19

RNA (array) & Protein/metabolite (MS) quantitation RNA measures are closer to genomic regulatory motifs

RNA (array) & Protein/metabolite (MS) quantitation RNA measures are closer to genomic regulatory motifs & transcriptional control Protein/metabolite measures are closer to Flux & growth phenotypes. 20

8 cross-checks for regulon quantitation In vitro Protein fusions array binding A-B or selection

8 cross-checks for regulon quantitation In vitro Protein fusions array binding A-B or selection A In vivo crosslinking & selection (1 -hybrid) B Microarray data Coregulated sets of genes EC SC BS HI P 1 P 2 P 3 P 4 P 5 P 6 P 7 1 1 0 1 0 1 1 1 1 0 Phylogenetic profiles TCA cycle B. subtilis pur. M pur. N pur. H pur. D E. coli pur. M pur. N Metabolic pathways pur. H pur. D Conserved operons Known regulons in other organisms 21

Check regulons from conserved operons (chromosomal proximity) B. subtilis pur. E pur. K pur.

Check regulons from conserved operons (chromosomal proximity) B. subtilis pur. E pur. K pur. B pur. C pur. L pur. F pur. M pur. N pur. H pur. D C. acetobutylicum pur. E pur. C pur. F pur. M pur. N pur. H pur. D In E. coli, each color above is a separate but coregulated operon: pur. E pur. K pur. H pur. D pur. M pur. N E. coli Pur. R motif pur. B pur. C pur. L pur. F Predicting regulons and their cisregulatory motifs by comparative genomics. Mcguire & Church, (2000) Nucleic Acids Research 28: 4523 -30. 22

Predicting the Pur. R regulon by piecing together smaller operons E. coli M. tuberculosis

Predicting the Pur. R regulon by piecing together smaller operons E. coli M. tuberculosis P. horokoshii C. jejuni M. janaschii P. furiosus pur. E pur. K pur. M pur. N pur. M pur. F pur. H pur. N pur. H pur. D pur. F pur. C pur. Q pur. C pur. L pur. H pur. M pur. F pur. C pur. Y pur. Q pur. L pur. Y The above predicts regulon Y connections among Q these genes: L C F M N H K E D 23

(Whole genome) RNA quantitation objectives RNAs showing maximum change minimum change detectable/meaningful RNA absolute

(Whole genome) RNA quantitation objectives RNAs showing maximum change minimum change detectable/meaningful RNA absolute levels (compare protein levels) minimum amount detectable/meaningful Network -- direct causality-- motifs Classify (e. g. stress, drug effects, cancers) 24

(Sub)cellular inhomogeneity Dissected tissues have mixed cell types. Cell-cycle differences in expression. XIST RNA

(Sub)cellular inhomogeneity Dissected tissues have mixed cell types. Cell-cycle differences in expression. XIST RNA localized on inactive ( see figure) X-chromosome 25

Fluorescent in situ hybridization (FISH) • Time resolution: 1 msec • Sensitivity: 1 molecule

Fluorescent in situ hybridization (FISH) • Time resolution: 1 msec • Sensitivity: 1 molecule • Multiplicity: >24 • Space: 10 nm (3 -dimensional, in vivo) 10 nm accuracy with far-field optics energy-transfer fluorescent beads nanocrystal quantum dots, closed-loop piezo-scanner (ref) 26

RNA 1: Today's story & goals z Integration with previous topics (HMM & DP

RNA 1: Today's story & goals z Integration with previous topics (HMM & DP for RNA structure) z Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality) z Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays) z Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-poly. A RNAs, effects of array geometry, cross-talk). z Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA). z Time series data: causality, m. RNA decay, time-warping 27

Steady-state population-average RNA quantitation methodology experiment Microarrays 1 ORF control • R/G ratios •

Steady-state population-average RNA quantitation methodology experiment Microarrays 1 ORF control • R/G ratios • R, G values ~1000 bp hybridization • quality indicators ORF Affymetrix 2 25 -bp hybridization • Averaged PM-MM PM MM • “presence” ORF SAGE Tag SAGE 3 • Counts of SAGE sequence counting MPSS 4 concatamers De. Risi, et. al. , Science 278: 680 -686 (1997) 4 Brenner et al, Lockhart, et. al. , Nat Biotech 14: 1675 -1680 (1996) 3 Velculescu, et. al, Serial Analysis of Gene Expression, Science 270: 484 -487 (1995) 14 to 22 -mers sequence tags for each ORF 1 2 28

Biotinylated RNA from experiment Gene. Chip expression analysis probe array Image of hybridized probe

Biotinylated RNA from experiment Gene. Chip expression analysis probe array Image of hybridized probe array Each probe cell contains millions of copies of a specific oligonucleotide probe Streptavidinphycoerythrin conjugate 29

Most RNAs <1 molecule per cell. Yeast RNA 25 -mer array Wodicka, Lockhart, et

Most RNAs <1 molecule per cell. Yeast RNA 25 -mer array Wodicka, Lockhart, et al. (1997) Nature Biotech 15: 1359 -67 Reproducibility confidence intervals to find significant deviations. (ref) 30

Microarray data analyses AFM AMADA Churchill CLUSFAVOR CLUSTER, D-CHIP GENE-CLUSTER J-EXPRESS PAGE PLAID SAM

Microarray data analyses AFM AMADA Churchill CLUSFAVOR CLUSTER, D-CHIP GENE-CLUSTER J-EXPRESS PAGE PLAID SAM (web) SMA SVDMAN TREE-ARRANGE & TREEPS VERA &SAM XCLUSTER Array. Tools ARRAY-VIEWER F-SCAN P-SCAN-ALYZE GENEX MAPS 31

Statistical models for repeated array data Tusher, Tibshirani and Chu (2001) Significance analysis of

Statistical models for repeated array data Tusher, Tibshirani and Chu (2001) Significance analysis of microarrays applied to the ionizing radiation response. PNAS 98(9): 5116 -21. Selinger, et al. (2000) RNA expression analysis using a 30 base pair resolution Escherichia coli genome array. Nature Biotech. 18, 1262 -7. Li & Wong (2001) Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol 2(8): 0032 Kuo et al. (2002) Analysis of matched m. RNA measurements from two different microarray technologies. Bioinformatics 32 18(3): 405 -12

“Significant” distributions graph t-test t= ( Mean / SD ) * sqrt( N ).

“Significant” distributions graph t-test t= ( Mean / SD ) * sqrt( N ). Degrees of freedom = N-1 H 0: The mean value of the difference =0. If difference distribution is not normal, use the Wilcoxon Matched-Pairs Signed-Ranks Test. 33

Independent Experiments Microarray analysis of the transcriptional network controlled by the photoreceptor homeobox gene

Independent Experiments Microarray analysis of the transcriptional network controlled by the photoreceptor homeobox gene Crx. Livesay, et al. (2000) Current Biology 34

RNA quantitation Is less than a 2 -fold RNA-ratio ever important? Yes; 1. 5

RNA quantitation Is less than a 2 -fold RNA-ratio ever important? Yes; 1. 5 -fold in trisomies. Why oligonucleotides rather than c. DNAs? Alternative splicing, 5' & 3' ends; gene families. What about using a subset of the genome or ratios to a variety of control RNAs? It makes trouble for later (meta) analyses. 35

36

36

(Whole genome) RNA quantitation methods Method Advantages Genes immobilized labeled RNAs immobilized labeled genes.

(Whole genome) RNA quantitation methods Method Advantages Genes immobilized labeled RNAs immobilized labeled genes. Northern gel blot QRT-PCR Reporter constructs Fluorescent In Situ Hybridization Tag counting (SAGE) Differential display & subtraction Chip manufacture RNA sizes Sensitivity 1 e-10 No crosshybridization Spatial relations Gene discovery "Selective" discovery 37

Microarray to Northern 38

Microarray to Northern 38

Genomic oligonucleotide microarrays 295, 936 oligonucleotides (including controls) Intergenic regions: ~6 bp spacing Genes:

Genomic oligonucleotide microarrays 295, 936 oligonucleotides (including controls) Intergenic regions: ~6 bp spacing Genes: ~70 bp spacing Not poly. A (or 3' end) biased Strengths: Gene family paralogs, RNA fine structure (adjacent promoters), untranslated & antisense RNAs, DNA-protein interactions. E. coli 25 -mer array Protein coding 25 -mers Non-coding sequences (12% of genome) Affymetrix: Mei, Gentalen, Johansen, Lockhart(Novartis Inst) HMS: Church, Bulyk, Cheung, Tavazoie, Petti, Selinger t. RNAs, r. RNAs 39

Random & Systematic Errors in RNA quantitation • Secondary structure • Position on array

Random & Systematic Errors in RNA quantitation • Secondary structure • Position on array (mixing, scattering) • Amount of target per spot • Cross-hybridization • Unanticipated transcripts 40

Spatial Variation in Control Intensity Experiment 1 Selinger et al experiment 2 41

Spatial Variation in Control Intensity Experiment 1 Selinger et al experiment 2 41

Detection of Antisense and Untranslated RNAs Expression Chip Reverse Complement Chip b 0671 -

Detection of Antisense and Untranslated RNAs Expression Chip Reverse Complement Chip b 0671 - ORF of unknown function, tiled in the opposite orientation Crick Strand Watson Strand (same chip) “intergenic region 1725” - is actually a small untranslated RNA (csr. B) 42

Mapping deviations from expected repeat ratios Li & Wong 43

Mapping deviations from expected repeat ratios Li & Wong 43

RNA 1: Today's story & goals z Integration with previous topics (HMM & DP

RNA 1: Today's story & goals z Integration with previous topics (HMM & DP for RNA structure) z Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality) z Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays) z Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-poly. A RNAs, effects of array geometry, cross-talk). z Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA). z Time series data: causality, m. RNA decay, time-warping 44

Independent oligos analysis of RNA structure Selinger 45 et al

Independent oligos analysis of RNA structure Selinger 45 et al

Predicting RNA-RNA interactions 46

Predicting RNA-RNA interactions 46

of the human genome using microarray technology. Shoemaker , et al. (2001) Nature 409:

of the human genome using microarray technology. Shoemaker , et al. (2001) Nature 409: 922 -7. 47

RNA 1: Today's story & goals z Integration with previous topics (HMM & DP

RNA 1: Today's story & goals z Integration with previous topics (HMM & DP for RNA structure) z Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality) z Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays) z Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-poly. A RNAs, effects of array geometry, cross-talk). z Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA). z Time series data: causality, m. RNA decay, time-warping 48

Time courses • To discriminate primary vs secondary effects we need conditional gene knockouts.

Time courses • To discriminate primary vs secondary effects we need conditional gene knockouts. • Conditional control via transcription/translation is slow (>60 sec up & much longer for down regulation) • Chemical knockouts can be more specific than temperature (ts-mutants). 49

Beyond steady state: m. RNA turnover rates (rifampicin time-course) Fraction of Initial (16 S

Beyond steady state: m. RNA turnover rates (rifampicin time-course) Fraction of Initial (16 S normalized) 1. 4 csp. E Chip lpp Chip csp. E Northern lpp Northern 1. 2 lpp Northern 1 0. 8 lpp Chip 0. 6 0. 4 half life 2. 4 min 2. 9 min lpp chip Northern half life >20 min >300 min csp. E Chip csp. E Northern 0. 2 csp. E chip Northern Chip metric = Smax 0 0 2 4 6 8 10 Time (min) 12 14 16 18 50

Time. Warp: pairs of expression series, discrete or interpolative a b c . .

Time. Warp: pairs of expression series, discrete or interpolative a b c . . . t 5 t 0 t 1 t 2 t 3 t 4 3 u 4 series b series a 4 t 6 u 2 u 1 series b j+1 2 j 1 u 0 0 d 2 3 series a 1 4 5 . . . e t 0 t 1 t 2 t 3 4 t 4 u 3 u 4 u 2 u 1 series b u 0 t 6 series b series a f . . . t 5 i+1 i 0 3 † 1 † j+1 * † 2 2 j 1 i 0 0 1 2 3 series a 4 5 . . . i+1 Aach & Church 51

Time. Warp: cell-cycle experiments 52

Time. Warp: cell-cycle experiments 52

Time. Warp: alignment example 53

Time. Warp: alignment example 53

RNA 1: Today's story & goals z Integration with previous topics (HMM & DP

RNA 1: Today's story & goals z Integration with previous topics (HMM & DP for RNA structure) z Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality) z Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays) z Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-poly. A RNAs, effects of array geometry, cross-talk). z Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA). z Time series data: causality, m. RNA decay, time-warping 54