DNA as Biological Information Rasmus Wernersson Henrik Nielsen
DNA as Biological Information Rasmus Wernersson Henrik Nielsen
Overview • Learning objectives – About Biological Information – A note about DNA sequencing techniques and DNA data – File formats used for biological data – Introduction to the Gen. Bank database
Hvad er gener? rough strain & DNA from killed smooth strain
DNA: sammensætning • Omkring 1950 vidste man at DNA var en polymer af nukleotider – nærmere bestemt deoxyribonukleotider. • De fire nukleotider udgør DNA er kun forskellige i deres nitrogen-base. • Der er to puriner (adenin og guanin) og to pyrimidiner (cytosin og thymin). Uracil (en pyrimidin) forekommer kun i RNA
Deoxyribonukleotider 5’ 4’ 1’ 2’ 3’ deoxy
DNA: sammensætning 2 Chargaff’s regel: Der er lige mængder A og T samt lige mængder C og G. (mens forholdet mellem G+C og A+T kan variere)
DNA: Røntgenkrystallografi: Rosalind Franklin DNA præparation: Maurice Wilkins Modelbygning og tolkning af røntgenspektre: Francis Crick & James Watson
Watson & Crick 1953
DNA Struktur
Information flow in biological systems
DNA sequences = summary of information Ribose 3’ 5 4 1 3 2 5’ Deoxyribose 5 4 1 3 2 5’ AGACC 3’ 3’ TCTGG 5’ 5’ 5’ ATGGCCAGGTAA 3’ DNA backbone: http: //en. wikipedia. org/wiki/DNA (Deoxy)ribose: http: //en. wikipedia. org/ 3’
PCR Melting 96º , 30 sec 35 cycles Annealing ~55º, 30 sec Extension 72º , 30 sec Animation: http: //depts. washington. edu/~genetics/courses/genet 371 b-aut 99/PCR_contents. html
PCR Animation: http: //www. people. virginia. edu/~rjh 9 u/pcranim. html PCR graph: http: //pathmicro. med. sc. edu/pcr/realtime-home. htm
Gel electrophoresis • DNA fragments are separated using gel electrophoresis – Typically 1% agarose – Colored with Et. Br or Zybr. Green (glows in UV light). – A DNA ”ladder” is used for identification of known DNA lengths. - + Gel picture: http: //www. pharmaceutical-technology. com/projects/roche/images/roche 3. jpg PCR setup: http: //arbl. cvmbs. colostate. edu/hbooks/genetics/biotech/gels/agardna. html
The Sanger method of DNA sequencing } OH Terminator X-ray sequenceing gel Images: http: //www. idtdna. com/support/technical/Technical. Bulletin. PDF/DNA_Sequencing. pdf
Automated sequencing • The major break-through of sequencing has happened through automation. • Fluorescent dyes. • Laser based scanning. • Capillary electrophoresis • Computer based basecalling and assembly. Images: http: //www. idtdna. com/support/technical/Technical. Bulletin. PDF/DNA_Sequencing. pdf
Handout exercise: ”base-calling” • Handout: Chromotogram • Groups of 2 -3. • Tasks: – Identify “difficult” regions – Identify “difficult” sequence stretches. – Try to estimate the best interval to use.
Sequence read mapping
DNA sekventering - historie 1972 Rekombinant DNA teknik [Paul Berg]. 1976 Det første sekventerede genom, bakteriofagen MS 2 [Walter Fiers et al. ] 1977 DNA sekventering ved kemisk kløvning [Allan Maxam & Walter Gilbert]; DNA sekventering ved enzymatisk syntese [Fred Sanger]. 1982 Gen. Bank (offentlig database over DNA sekvenser). 1987 Den første automatiske sekventeringsmaskine, Prism 373 [Applied Biosystems]. 1990 Human Genome Project søsættes. 1995 Det første genom af en fritlevende organisme, bakterien Haemophilus influenzae (1. 8 Mb) [The Institute for Genomic Research (TIGR)]. 1996 Det første genom af en eukaryot, bagegær, Saccharomyces cerevisiae (12. 1 Mb) [Internationalt konsortium]. 1998 Det første genom af et dyr, rundormen Caenorhabditis elegans (97 Mb) [Sanger Center og samarbejdspartnere]. 2001 De første “drafts” af det humane genom (3 Gb) [Human Genome Project Consortium (Nature, 15 Feb) + Celera (Science, 16 Feb)]. 15. Dec. 2011 Gen. Bank release 187 indeholder 146. 413. 798 sekvenser med i alt 135. 117. 731. 375 nukleotider (filerne fylder 568 GB).
Cost of sequencing
Background - Nucleotide databases • Gen. Bank, http: //www. ncbi. nlm. nih. gov/Genbank/ • • National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), USA Established in 1982. • EMBL, http: //www. ebi. ac. uk/embl/ • • European Bioinformatics Institute (EBI), England Established in 1980 by the European Molecular Biology Laboratory, Heidelberg, Germany Now part of ENA, the European Nucleotide Archive, http: //www. ebi. ac. uk/ena/ • • DDBJ, http: //www. ddbj. nig. ac. jp/ • National Institute of Genetics, Japan • Together they form • International Nucleotide Sequence Database Collaboration, http: //www. insdc. org/
Nucleotide database growth • Growth is roughly exponential • But doubling time has increased from ~20 months (1990 s) to ~50 months (2010) • NB: The databases are public — no restrictions on the use of the data within.
FASTA format >alpha-D ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCAC CCAGACTGTGGAGCCGAGGCCCTGGAGAGGTGCGGGCTGAGCTTGGGGAAACCATGGGCA AGGGGGGCGACTGGGAGCCCTACAGGGCTGCTGGGGGTTGTTCGGCTGGGGGTCAG CACTGACCATCCCGCAGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCC CCCACTTCGACTTGCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGG CCGCCTTGGGCAACGCTGTCAAGAGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCA GCGACCTGCATGCCTACAACCTGCGTGTCGACCCTGTCAACTTCAAGGCGGGGGAC GGGGGTCAGGGGCCGGGGAGTTGGGGGCCAGGGACCTGGTTGGGGATCCGGGGCCATGCC GGCGGTACTGAGCCCTGTTTTGCCTTGCAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTG GCCACACACCTGGGCAACGACTACACCCCGGAGGCACATGCTGCCTTCGACAAGTTCCTG TCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGATAA >alpha-A ATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGC CAGGCCGGTGACTTGGGTGGTGAAGCCCTGGAGAGGTATGTGGTCATCCGTCATTACCCC ATCTCTTGTCTGTGACTCCATCTGCCCCCATACTCTCCCCATAACTG TCCCTGTTCTATGTGGCCCTGGCTCTGTCTCATCTGTCCCCAACTGTCCCTGATTGCCTC TGTCCCCCAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACC TGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTG AGGCTGCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACG CCCAAAAGCTCCGTGTGGACCCCGTCAACTTCAAAGTGAGCATCTGGGAAGGGGTGACCA GTCTGGCTCCCCTCCTGCACACACCTCTGGCTACCCCCTCACCCCCTTGCTCACC ATCTCCTTTTGCCTTTCAGCTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTT CCCCTCTCTCCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGG CACCGTCCTTACTGCCAAGTACCGTTAA (Handout)
Gen. Bank format • Originates from the Gen. Bank database. • Contains both a DNA sequence and annotation of feature (e. g. Location of genes). (handout)
Gen. Bank format - HEADER LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS TITLE JOURNAL PUBMED COMMENT CMGLOAD 1185 bp DNA linear VRT 18 -APR-2005 Cairina moschata (duck) gene for alpha-D globin. X 01831. 1 GI: 62724 alpha-globin; globin. Cairina moschata (Muscovy duck) Cairina moschata Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Archosauria; Aves; Neognathae; Anseriformes; Anatidae; Cairina. 1 (bases 1 to 1185) Erbil, C. and Niessing, J. The primary structure of the duck alpha D-globin gene: an unusual 5' splice junction sequence EMBO J. 2 (8), 1339 -1343 (1983) 10872328 Data kindly reviewed (13 -NOV-1985) by J. Niessing.
Gen. Bank format - ORIGIN section ORIGIN 1 61 121 181 241 301 361 421 481 541 601 661 721 781 841 901 961 1021 1081 1141 // ctgcgtggcc cagggtgcta agcctgccac gtgggagaag gctgggccca tgggctggga aaaactgact ttcccccact gcggctgccc ctcagcaacc gactagggtctgagtt gggtaccagg gtgggccaga gggggactca tccggagcag tggtgctggc agttcttgtc cccttgcacc gggcatcggg tcagcccctc taagagctcg gccgctgccg gtggcc gggggcactc cccagagcgc ggcctcgctc tcgacctgca tgggcaatgc tgcatgccta cttgggtctg tcctggggtc gtcctggggg ggctgggatt gggcctcagg gggtactaag cgcacacctg cgccgtggct ttcaataaag ggtcccaggg cacccctcca gccccgcggg ccatgctgac accaggagga acagggtggg cacggggtgc cggcaggatg tcccggctct cgtgaagagc caacctgcgt ggggtctgag tggcagtcct ccagcagcca gtgtttggaa gggactcggg ccctggtttg ggcaaagact gccgtgctgg acaccattac agggctgggt cgctgataag tgtctccacc cgccgaggac attcggaagt cagcagggag gggctgagat ttcctcgcct gaacaggtcc ctggacaacc gttgaccctg ggtgtggggt gggggctgag gacagcaggg tgggagctgg gggggactga ccttgcagct acagccccga ctgaaaagta cacagctctg tgcttccaca ataaggccag acagaaaccc aagaagctca gaagctctgc caggagccct gggcaaagca acccccagac gtggccatgg tcagccaggc tcaacttcaa gcagggtctg ggccagggtc gctgggattg gcaggggcta gggagactca gctggcacag gatgct cagatgagcc tgtctgtgtg catcc ggcgggagcg gtcagttgcc tcgtgcaggt agaggtgtgg gcagcgggtg gcagggcacc caagacctac caagaaagtg cctgtctgag ggcaagcggg ggggtccagg ctgtggtctt catctgggat gggccagggt gggccatctg tgcttccagg gcctttgaca actgca tgctgggact
Gen. Bank format - FEATURE section FEATURES source CAAT_signal TATA_signal precursor_RNA exon CDS repeat_region intron repeat_region exon intron exon poly. A_signal Location/Qualifiers 1. . 1185 /organism="Cairina moschata" /mol_type="genomic DNA" /db_xref="taxon: 8855" 20. . 24 69. . 73 101. . 1114 /note="primary transcript" 101. . 234 /number=1 join(143. . 234, 387. . 591, 939. . 1067) /codon_start=1 /product="alpha D-globin" /protein_id="CAA 25966. 2" /db_xref="GI: 4455876" /db_xref="GOA: P 02003" /db_xref="Inter. Pro: IPR 000971" /db_xref="Inter. Pro: IPR 002338" /db_xref="Inter. Pro: IPR 002340" /db_xref="Inter. Pro: IPR 009050" /db_xref="Uni. Prot/Swiss-Prot: P 02003" /translation="MLTAEDKKLIVQVWEKVAGHQEEFGSEALQRMFLAYPQTKTYFP HFDLHPGSEQVRGHGKKVAAALGNAVKSLDNLSQALSELSNLHAYNLRVDPVNFKLLA QCFQVVLAAHLGKDYSPEMHAAFDKFLSAVAAVLAEKYR" 227. . 246 /note="direct repeat 1" 235. . 386 /number=1 289. . 309 /note="direct repeat 1" 387. . 591 /number=2 592. . 939 /number=2 940. . 1114 /number=3 1095. . 1100 1114
Exercise: Gen. Bank • Work in groups of 2 -3 people. • The exercise guide is linked from the course programme. • Read the guide carefully - it contains a lot of information about Gen. Bank.
- Slides: 28