Basic Molecular Biology Basic Molecular Biology l l

Basic Molecular Biology

Basic Molecular Biology l l Structures of biomolecules How does DNA function? What is a gene? Computer scientists vs Biologists

Bioinformatics schematic of a cell

Macromolecule (Polymer) DNA Monomer RNA Ribonucleotides (NTP) Protein or Polypeptide Amino Acid Deoxyribonucleotid es (d. NTP)

Nucleic acids (DNA and RNA) l l Form the genetic material of all living organisms. Found mainly in the nucleus of a cell (hence “nucleic”) Contain phosphoric acid as a component (hence “acid”) They are made up of nucleotides.

Nucleotides l A nucleotide has 3 components l l l Sugar (ribose in RNA, deoxyribose in DNA) Phosphoric acid Nitrogen base l l Adenine (A) Guanine (G) Cytosine (C) Thymine (T) or Uracil (U)

Monomers of DNA l A deoxyribonucleotide has 3 components l l l Sugar - Deoxyribose Phosphoric acid Nitrogen base l l Adenine (A) Guanine (G) Cytosine (C) Thymine (T)

Monomers of RNA l A ribonucleotide has 3 components l l l Sugar - Ribose Phosphoric acid Nitrogen base l l Adenine (A) Guanine (G) Cytosine (C) Uracil (U)

Nucleotides Nitrogenous Base Phosphate Group Sugar

DNA RNA A T A G C C G C G A T A C G T A U G C G A=T G=C G T U C

Proteins l Composed of a chain of amino acids. 20 possible groups R | H 2 N--C--COOH | H

Proteins R | H 2 N--C--COOH | H

Dipeptide This is a peptide bond R O R | II | H 2 N--C--C--NH--C--COOH | | H H

Protein structure l l Linear sequence of amino acids folds to form a complex 3 -D structure. The structure of a protein is intimately connected to its function.

Structure -> Function l It is the 3 -D shape of proteins that gives them their working ability – generally speaking, the ability to bind with other molecules in very specific ways.

DNA: information store RNA: information store and catalyst Protein: superior catalyst

DNA in action l Questions about DNA as the carrier of genetic information: l l What is the information? How is the information stored in DNA? How is the stored information used ? Answers: l l l Information = gene → phenotype Information is stored as nucleotide sequences. . . and used in protein synthesis.

l How does the series of chemical bases along a DNA strand (A/T/G/C) come to specify the series of amino acids making up the protein?

The need for an intermediary l l l Fact 1 : Ribosomes are the sites of protein synthesis. Fact 2 : Ribosomes are found in the cytoplasm. Question : How does information ‘flow’ from DNA to protein?

The Intermediary l l l Ribonucleic acid (RNA) is the “messenger”. The “messenger RNA” (m. RNA) can be synthesized on a DNA template. Information is copied (transcribed) from DNA to m. RNA. (TRANSCRIPTION)

Biological functions of RNA DNA TRANSCRIPTION • Mediate of the protein synthesis • Messenger RNA (n. RNA) • Transfer RNA (t. RNA) • Ribosomal RNA (r. RNA) • Structural molecule: Ribosomal RNA • Catalytic molecule: ribozyme r. RNA m. RNA t. RNA ribosome TRADUCTION PROTEINE • Guide molecule: primer of DNA replication, protein degradation (tm RNA)… • Ribonucleoprotein (complex of RNA and protein): m. RAN edition, m. RAN spicing, protein transport…

Transcription l l The DNA is contained in the nucleus of the cell. A stretch of it unwinds there, and its message (or sequence) is copied onto a molecule of m. RNA. The m. RNA then exits from the cell nucleus. Its destination is a molecular workbench in the cytoplasm, a structure called a ribosome.

Principal steps of the transcription 1. Polymerase RNA randomly binds on the DNA and seeks for a promoter (5’ 3’) 2. Opening of the DNA 3. Initiation of the polymerization 4. Elongation: 1. 20 -50 nucleotides/sec 2. 1 error/104 nucleotides 5. Termination (at the termination signal)

RNA polymerase l It is the enzyme that brings about transcription by going down the line, pairing m. RNA nucleotides with their DNA counterparts.

Promoters l Promoters are sequences in the DNA just upstream of transcripts that define the sites of initiation. Promoter l 5’ 3’ The role of the promoter is to attract RNA polymerase to the correct start site so transcription can be initiated.

Promoter l So a promoter sequence is the site on a segment of DNA at which transcription of a gene begins – it is the binding site for RNA polymerase.

Termination site of the transcription

Next question… l l How do I interpret the information carried by m. RNA? Think of the sequence as a sequence of “triplets”. Think of AUGCCGGGAGUAUAG as AUGCCG-GGA-GUA-UAG. Each triplet (codon) maps to an amino acid.

Translation: m. RNA protein • Codons UAA, UAG and UGA are stop codons because there is no corresponding t. RNA (exception…); • Codon AUG code for initiator methionine (exception); • The code is almost-universal.

The Genetic Code

Translation l At the ribosome, both the message (m. RNA) and raw materials (amino acids) come together to make the product (a protein).

Translation l l The sequence of codons is translated to a sequence of amino acids. How do amino acids get to the ribosomes? l They are brought there by a second type of RNA, transfer RNA (t. RNA).

Translation l Transfer RNA (t. RNA) – a different type of RNA. l l l Freely float in the cytoplasm. Every amino acid has its own type of t. RNA that binds to it alone. Anti-codon – codon binding crucial.

t. RNA

t. RNA One end of the t. RNA links with a specific amino acid, which it finds floating free in the cytoplasm. It employs its opposite end to form base pairs with nucleic acids – with a codon on the m. RNA tape that is being read inside the ribosome.

t. RNA

Transfer RNA • 61 different t. RAN, composed of from 75 to 95 nucleotides • Recognition of a codon and binding to the corresponding amino acid

Elongation of the translation The ribosome move by 3 nucleotides toward 3’ (elongation); in 1 second a Bacteria ribosome adds 20 amino acids! Eucaryote: 2 amino acids/second ! A stop codon stop (UAA, UAG, AGA) In the same reading frame, end the process; the ribosome break away from the m. RNA.

Polyribosome (polysomes): eukaryote and prokaryote Duration of the protein synthesis: between 20 seconds and several minutes: multiple initiations ~80 nucleotides between 2 ribosomes Eukaryotes: 10 ribosomes / m. RNA Procaryotes: up to 300 ribosomes / m. RNA

The gene and the genome l l A gene is a length of DNA that codes for a protein. Genome = The entire DNA sequence within the nucleus.

Estimate of the number of genes (proteins + t. RNA + r. RNA) Organism Sizee (bp) Number of genes % coding Remarks E. coli 4, 639, 221 4, 397 87 % Eubacterie Methanococcus jannashii 1, 664, 970 1, 758 87 % Archae Saccharomyces cerevisiae 12, 057, 849 6, 551 72 % Arabidopsis thaliana ~135, 000 ~ 25’ 000 ? Caenorhabditis elegans 87, 567, 338 17, 687 21 % 1000 cells Drosophila melanogaster ~180, 000 ~13, 600 20 % Core proteome: 8, 000 (families) Human ~3, 000, 000 20, 00025, 000 4 -7 % (? )

Genome coding regions Gene definition • Nucleic acid sequence required for the synthesis of: • a functional polypeptide • a functional RNA (t. RNA, r. RNA, …) • A gene coding for a protein generally contains: • a coding sequence (CDS) • control regions for transcription and translation (promoter, enhancer, poly A site…) A gene contains coding and non-coding regions

More complexity l l The RNA message is sometimes “edited”. Exons are nucleotide segments whose codons will be expressed. Introns are intervening segments (genetic gibberish) that are snipped out. Exons are spliced together to form m. RNA.

Standard structure of a gene for vertebrate

RNA processing: Splicing • Pre-messenger RNA contains coding sequence regions (exon: express sequence) alternate with non-coding regions (intron: intervening sequence) • Splicing: excision of the introns

Splicing: generalities • High variability of the number of intron between genes in a given specie Ex: human: from 2 introns (insulin) to more than 100 introns (117 introns collagen type VII) • High variability of the number of intron between species : Ex: yeast gene has few introns (max 2 introns / gene). • High variability of the size of the introns (min 18 nucleotides; to 300 kb); • High variability of the size of the exons (min 8 coding nucleotides); • Mitochondrial human genes do not contain introns, but mitochondrial vegetal and fungus (yeast include) contain introns; chloroplast’s genes contain introns; there exists introns for some prokaryotes ! • Importance in evolution; facilitate genetic recombination; linked with the notion of domains in proteins • Human: average: 7 kb intron / 1 kb exon;

Alternative splicing The exon order is generally fixed (except for exon scrambling)

Summery of the whole process

Proteins • Several levels from primary to quaternary structure • Composed of amino acids

Protein Structure l Proteins are polypeptides of 70 -3000 amino-acids l This structure is (mostly) determined by the sequence of amino-acids that make up the protein

Functional categories Enzymes l Transport l Regulation l Storage l Structure l Contraction l Protection l Scaffold proteins l Exotics l Kinase, Protéase Hemoglobin, Insuline, Répresseur lac Caséine, Ovalbumine Protéoglycan, Collagène Actine, Myosine Immunoglobulines, Toxines Grb 2, crk Resiline, protéines adhésives

Number of proteins in various organisms Organism Number Bacteria Yeast C. elegans Drosophila Human 500 -6’ 000 19’ 000 15’ 000 30’ 000 -1’ 000

Protein Structure

Example of structural motif: HTH • • Helix – Turn – Helix (HTH) motif very common (prokaryotes et eukaryotes) DNA binding site for procaryotes:

From Genome to Proteome Human: about 25’ 000 genes Genome « After ribosomes » Definition of PTM: Any modification of a polypeptide chain that involves the formation or breakage of a covalent bond. Proteome Human: about one million proteins; several proteomes 5 to 10 fold Post-translational protein modification (PTM) Increase in complexity 10 -42 % Alternative splicing of m. RNA

Evolution l Related organisms have similar DNA l l l Similarity in sequences of proteins Similarity in organization of genes along the chromosomes Evolution plays a major role in biology l l Many mechanisms are shared across a wide range of organisms During the course of evolution existing components are adapted for new functions

Evolution of new organisms is driven by l Diversity l l Mutations l l Different individuals carry different variants of the same basic blue print The DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc. Selection bias

Numerous possible effect of mutation Original sequence Amino Acids ARNm ADN N-Phe Arg Trp Ile Ala Lys-C 5’-UUU CGA UGG AUA GCC AAA-3’ 3’-AAA GCT ACC TAT CGG TTT 5’ 5’-TTT CGA TGG ATA GCC AAA 3’

Source: Alberts et al The Tree of Life

Central dogma ZOOM IN t. RNA transcription DNA r. RNA sn. RNA translation m. RNA POLYPEPTIDE

Bioinformatics l Studies the flow of information in biomedicine l Information flow from genotype to phenotype DNA → Protein → Function → Organism → Population → DNA l Experimental flow for creating and testing models Hypothesis → Experiment → Data → Conflict → Hypothesis

Computational Biology and Bioinformatics The systematic development and application of computing systems and computational solution techniques to the analysis of biological data obtained by experiments, modeling, database search, and experimentation l l l Explosion of experimental data Difficulty in interpreting data Need for new paradigms for computing with data and extracting new knowledge from it

Brief history of early bioinformatics • Molecular sequences and data bases Dayhoff (atlas of proteins, 1965) Zuckerkandl & Pauling (1965), Bilofsky (Gen. Bank, 1986), Hamm & Cameron (EMBL, 1986), Bairoch (Swiss-Prot, 1986) • Molecular sequence comparison Needle. Man & Wunsch (1970), Smith & Waterman (1981), Pearson-Lipman (Fasta, 1985), Altschul (Blast, 1990) • Multiple alignment and automatic phylogeny Aho (common subsequence, 1976), Felsenstein (infering phylogenies, 1981 -1988), Sankoff & Cedergren (multiple comparison, 1983), Feng & Doolittle (Clustal, 1987), Gusfield (inferring evolutionary trees, 1991), Thompson (Clustal. W, 1994) • Motif search and discovery Fickett (ORF, 1982), Ukkonen (approximate string matching, 1985), Jonassen (Pratt, 1995), Califano (Splash, 2000) Pevzner (WINNOVER, 2000) • But also: RNA structure prediction, protein threading, protein foldings… Few fields and large use of combinatoric/dynamic programming approaches

New biological data imply new bioinformatics field • Sequence Motif search, motif discovery, alignment… Data indexing, regular language, dynamic programming, HMM, EM, Gibbs sampling… • Structure RNA folding, protein threading, protein folding… Palindrome search, context-(free, sensitive) language, dynamic programming, combinatorial optimization… • DNA chip Classification, clustering, feature selection, regulation network… NN, SVM, Bayesian inference, (hierarchical, k, Gaussian)-clustering, differencial model… • Proteomics Spectrum analysis, image pattern matching, probabilistic model… • Bibliographic data Ontology, text mining…

Important source of data and information GENEBANK: http: //www. ncbi. nih. gov Swiss-prot: http: //us. expasy. org/sprot/relnotes Protein Data Bank (PDB): http: //www. rcsb. org/pdb/home. do Stanford Microarray DB http: //smd. stanford. edu Med. Line or Pub. Med http: //genome. ucsc. edu or http: //www. ebi. ac. uk/ensembl Journals: Bioinformatics, BMC bioinformatics, Nucleic Acids Research, Journal of Molecular Biology, Proteomics…

Computer scientists vs Biologists l l (Almost) Nothing is ever completely true or false in Biology. Everything is either true or false in computer science.

Computer scientists vs Biologists l l Biologists strive to understand the very complicated, very messy natural world. Computer scientists seek to build their own clean and organized virtual worlds.

Computer scientists vs Biologists l l l Biologists are more data driven. Computer scientists are more algorithm driven. One consequence is CS www pages have fancier graphics while Biology www pages have more content.

Computer scientists vs Biologists l l Biologists are obsessed with being the first to discover something. Computer scientists are obsessed with being the first to invent or prove something.

Computer scientists vs Biologists l Biologists are comfortable with the idea that all data has errors. l Computer scientists are not.

Computer scientists vs Biologists l Computer scientists get high-paid jobs after graduation. l Biologists typically have to complete one or more post-docs. . .

Computer Science is to Biology what Mathematics is to Physics