Biological Sequence Determination DNA Robert M Horton Ph
Biological Sequence Determination DNA Robert M. Horton, Ph. D, MS rmhorton@cybertory. org artwork: commons. wikimedia. org protein
Sequencing context protein RNA DNA technological biological old methods classical sequencing (Sanger) automation, base calling, quality scoring shotgun sequencing, assembly, finishing concepts chemistry, enzymes physics, computers contemporary methods: pyrosequencing, CRT, SOLi. D "next generation" applications: resequencing, epigenetics, RNA-Seq “third generation” SMRT, nanopores, etc. microfluidics microfabrication contemplation of the future
Protein Sequencing Why Proteins? Small Digestible (pepsin, trypsin, Chemically distinguishable chymotrypsin) (purifyable) Important Insulin Fred Sanger Nobel prize, 1958
Classes of RNA modified bases (GMe, GMe 2, CMe, T, ψ, UH 2, I, IMe) r. RNA modified bases ( cap with m 7 G , 2'O-methylation ) splicing polyadenylation sn. RNA prokaryotic: 70 S = 50 S (5 S, 23 S) + 30 S (16 S) eukaryotic: 80 S = 60 S (5 S, 5. 8 S, 28 S) + 40 S (18 S) 7 SL RNA of Signal Recognition Particle (SRP) homologous to Alu SINE (11% of human genome) splicosomes (U 1, U 2, U 4, U 5, U 6) sno. RNA t. RNA pre-r. RNA processing (U 3) guide 2'-O-methylation guide pseudouridylation RNAi si. RNA (short interfering RNA) mi. RNA (micro. RNA) post-transcriptional gene silencing 3' UTR, conserved pi. RNA transcriptional silencing of retrotransposons . . . et cetera. . .
DNA Sequencing 1977 The “modern era” of DNA sequencing begins
Chemical Sequencing of DNA (Maxam-Gilbert) February 1977 Two steps: Damage bases specific, partial Cleave backbone Four reactions: A A+G C C+T http: //nobelprize. org/nobel_prizes/chemistry/laureates/1980/gilbert-lecture. pdf
Chain Termination Sequencing (Sanger Sequencing) 2', 3'-dideoxy TTP Sanger F, Nicklen S & Coulson AR DNA sequencing with chain-terminating inhibitors PNAS 74: 5463 -7, December 1977
Primer Extension Bacterial DNA polymerase I adds nucleotides to the 3' end of primer to complement 5' -overhanging template. Each strand is an ordered sequence with a direction. Arrows indicate 5' to 3' direction (DNA grows biochemically in this direction). (pyrophosphate released)
Sanger sequencing Individual reactions with one d. NTP partially “poisoned” with dideoxynucleotides (dd. ATP, dd. CTP, dd. GTP, dd. TTP) Decades of improvements automated fluorescence four colors one lane dye terminators one reaction capillaries
Automated Sanger sequencing trace base calls quality scores
Quality Score q = -10 * log 1 0(p) p = predicted error probability 1/1000 probability of error = q score of 30 uses data quality monitoring assembly consensus finishing criteria
Sequencing Strategy Primer walking (serial) Shotgun Sequencing (parallel)
Universal Primers
Assembly
read length affects assembly
Next-Generation Sequencing Pyrosequencing (454/Roche) Cycles of Reversible Termination (Solexa/Illumina) Ligation (ABI SOLi. D) "Third-Generation" Sequencing SMRT (Pacific Biosciences)
pyrosequencing + pyrophosphate APS (released by d. NTP incorporation) adenosine 5`-phosulfate ATP sulfurylase + sulfate ATP
pyrosequencing + ATP O 2 oxygen + luciferin firefly luciferase AMP + pyrophosphate + light + oxyluciferin
pyrosequencing more biochemistry problem solution pyrophosphate recycling apyrase breaks down ATP to AMP + 2 Pi (or wash out solution) luciferase can use d. ATP use an analog suitable for polymerase but not luciferase
pyrosequencing flowgram Ronaghi M. Genome Res 11: 3 -11, 2001
Emulsion PCR water droplet in oil one primer bound to solid bead individual template molecule
Emulsion PCR DNA anchored to bead all comes from the same template molecule "polony" = "PCR colony"
pyrosequencing Alternatives to chemiluminescence heat (“thermosequencing”) p. H change ("Ion Torrent")
Cycles of Reversible Termination Illumina/Solexa Helicos Illumina Metzker M. Sequencing Technologies - The Next Generation. Nature Reviews Genetics 11: 31 -46, 2010.
Short Read Alignment
FASTQ Format maq. sourceforge. net/fastq. shtml $q = chr(($Q<=93? $Q : 93) + 33); 0 60 $Q = ord($q) - 33; !"#$%&'()*+, -. /0123456789: ; <=>? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[]
Paired End Tags Mme I TCCRAC (20/18)
Illumina Genome Analyzer Library Preparation
Illumina Genome Analyzer Bridge Amplification forms "Polonies"
Illumina Genome Analyzer Cycles of Reversible Termination
Ligation-based Sequencing SOLi. D (ABI) Complete Genomics Polonator (Church Lab)
SOLi. D Sequencing by Oligonucleotide Ligation and Detection 3'- ATNNN~ZZZ*-5' artwork is from the pamphlet Dibase Sequencing and Color Space Analysis
SOLi. D
SOLi. D: Dibase Encoding AT CG GC TA AC CA GT TG AA CC GG TT AG CT GA TC
SOLi. D: Dibase Encoding base space color space Each color sequence can represent four different base sequences. The base sequence is one unit longer than the color sequence. You need to know one base to tell which sequence is represented.
SOLi. D: Dibase Encoding SNP causes two color changes single color change is probably an error
Single-Molecule, Real-Time (SMRT) Sequencing High throughput Parallelism (small reactions) Speed (immediate results) Long reads Read individual templates from mixtures Haplotyping
SMRT Sequencing 41
Simulated SMRT Sequencing Data
Platform Comparisons Xu M, Fujita D, and Hanagata N. Perspectives and Challenges of Emerging Single-Molecule DNA Sequencing Technologies. Small 5(23): 2638– 2649, 2009
Other Technologies Mass spectrometry TEM STM nanonozzle probes nanopores (protein, graphene) ionic current blockage transverse tunneling currents exonuclease
Targeted Exome Capture nimblegen. com
Bonus Slides
Selenocysteine t. RNA
Omics transcriptome exome kinome
“Plus and Minus” Method (circa 1975) "minus": polymerase stops at missing base "plus": T 4 DNA polymerase 3' exonuclease stalled by d. NTP Sanger F, Coulson AR. J Mol Biol. 94(3): 441 -8, 1975
pyrosequencing Animation: http: //www. pyrosequencing. com/Dyn. Page. aspx? id=7454
Bioinformatics Classics Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48: 443 -453, 1970. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol 147: 195 -197, 1981.
Automated Base Calling 1. identify idealized peak locations assume locally even spacing 2. find observed peaks 3. match observed to expected omit and split as necessary 4. add "good" unmatched peaks
Error Probabilities predictive does not require knowing actual sequence valid the set of bases assigned to probability p should have an actual error rate of p discriminating helps to distinguish correct vs. incorrect base calls 1, 000 base calls with 1, 0000 errors (p = 0. 01) better if we can break it into two 500, 000 sets: p=0. 018 in one set (9000 errors) p=0. 002 in second set (1000 errors)
Error Probability Calibration 'Given a set of parameters and a training set of reads for which it is known which base-calls are correct and which are errors, find a way of associating parameter values to error probabilities that has (near) maximum discrimination power for small r. '
Phred Quality Score Parameters Empirical. Small values tend to correspond to more accurate base-calls. Window-based parameters smooth out error probabilities. 1. Peak spacing (7 peak window) ● largest / smallest peak-to-peak spacing 2. Uncalled/called ratio (7 peak window) ● amplitude of largest uncalled / smallest called peak 3. Uncalled/called ratio (3 peak window) 4. Peak resolution ● -1 * # bases to the next unresolved base
Lookup Table Production Select a range of 50 threshold values for each of the 4 parameters. For each 4 -tuple of parameter thresholds (504=6, 250, 000): find the set of bases defined by these thresholds compute empirical error rates The parameter set with the lowest error rate goes into the table. These 50 values are chosen so that each increment contains approximately the same number of bases in the training set. if multiple 4 -tuples give the same rate, choose the largest set These bases are removed, and the process is repeated until all bases are represented in the table.
Post-translational Modification (or co-translational) glycosylation (glycoproteins) mucin, cellular interaction, structural N-linked hydroxylysine in collagen covalently bound enzyme cofactors thyroid hormone hydroxylation FAD, biotin, etc ubiquitination methylation isoprenylation phosphorylation acetylation (acetate, CH 3 CO 2− ) myristoylation (myristate, a C 14 fatty acid) palmitoylation (palmitate, a C 16 fatty acid) alkylation iodination asparagine serine, threonine, hydroxylysine, hydroxyproline acylation (at O, N, or S) O-linked signal transduction ADP-ribosylation signal transduction cholera toxin . . . and many more
“Wandering Spot” Method ca. 1970 s RNA or DNA partial digestion 2 D separation Horizontal = base composition Vertical = size This is an RNAse T 1 fragment, so it ends in G Fuke, M. , and Busch, H. Nucleic Acids Res. 4: 339 -352, 1977.
Enzymatic vs Chemical Partial Cleavage of RNA Sequence-specific RNases Phy M: A+U A: pyrimidine-specific (C+U) U 2: A or A+G T 1: degrades after G residues V 1: degrades paired bases Peattie DA. PNAS 76: 1760 -1764, 1979. enzymatic chemical
Modified Nucleotides in t. RNA (post-transcriptional) methyl guanine (GMe) pseudouridine (ψ) dimethylguanine(GMe 2) dihydrouridine (UH 2) methylcytosine (Me) inosine (I) ribothymine (T) methylinosine (IMe)
Nucleotide Ambiguity Codes (IUPAC) Unambiguous A, C, G, T, U 2 -fold degenerate M = A or C R = A or G (pu. Rine) W = A or T (Weak) S = C or G (Strong) Y = C or T (p. Yrimidine) K = G or T 3 -fold degenerate V = A, C or G (not T) H = A, C or T (not G) D = A, G or T (not C) B = C, G or T (not A) 4 -fold degenerate X = A, C, G or T N = A, C, G or T
Automated Base Calling Phred third-party base caller with better accuracy than ABI's open source(ish) Ewing B, Hillier L, Wendl MC, Green P. Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment. Genome Res. 8: 175 -185, 1998 Ewing B and Green P. Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome Res. 8: 186 -194, 1998
Shotgun Sequencing Staden R. A strategy of DNA sequencing employing computer programs, Nucleic Acids Research 7: 2601 -2610, 1979 “With modern fast sequencing techniques and suitable computer programs it is now possible to sequence whole genomes without the need of restriction maps. This paper describes computer programs that can be used to order both sequence gel readings and clones. A method of coding for uncertainties in gel readings is described. These programs are available on request. ” “The whole of the DNA to be sequenced is shotgunned into a suitable vector and cloned. Ideally the cloned fragments would be of at least 200 bases in length. The clones are then sequenced and the computer used to collate the data. Collation involves searching for overlaps in the data. ”
2 D gel electrophoresis
cybertory. org/exercises/primer. Design
Protein Sequencing Edman Degradation phenylisothiocyanate invented ca. 1950 s automated ca. 1973 proceeds from N-terminus read 50 -70 aa http: //en. wikipedia. org/wiki/Edman_degradation A few amino acids can ID a spot on 2 D gel Mass Spectrometry Precise determination of molecular weights of peptides
(Sec) modified from Wikimedia commons
- Slides: 68