Genome Annotation Haixu Tang School of Informatics Genome

  • Slides: 40
Download presentation
Genome Annotation Haixu Tang School of Informatics

Genome Annotation Haixu Tang School of Informatics

Genome and genes • Genome: an organism’s genetic material (Car encyclopedia) • Gene: a

Genome and genes • Genome: an organism’s genetic material (Car encyclopedia) • Gene: a discrete units of hereditary information located on the chromosomes and consisting of DNA. (Chapters to make components of a car, or to use and drive a car).

Gene Prediction: Computational Challenge aatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcggctatgcaagctgggatccgatgactatgct aagctgggatccgatgacaatgcggctatgctaatggtcttgggattt accttggaatgctaagctgggatccgatgacaatgcggctatgctaatgaat ggtcttgggatttaccttggaatatgctaatgcggctatgctaagctgggat ccgatgacaatgcggctatgctaatgcggctatgcaagctgggatccg atgactatgctaagctgcggctatgctaatgcggctatgctaagctgggatc cgatgacaatgcggctatgctaatgcggctatgcaagctgggatcctg cggctatgctaatggtcttgggatttaccttggaatgctaagctgggatccg atgacaatgcggctatgctaatggtcttgggatttaccttggaatatg

Gene Prediction: Computational Challenge aatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcggctatgcaagctgggatccgatgactatgct aagctgggatccgatgacaatgcggctatgctaatggtcttgggattt accttggaatgctaagctgggatccgatgacaatgcggctatgctaatgaat ggtcttgggatttaccttggaatatgctaatgcggctatgctaagctgggat ccgatgacaatgcggctatgctaatgcggctatgcaagctgggatccg atgactatgctaagctgcggctatgctaatgcggctatgctaagctgggatc cgatgacaatgcggctatgctaatgcggctatgcaagctgggatcctg cggctatgctaatggtcttgggatttaccttggaatgctaagctgggatccg atgacaatgcggctatgctaatggtcttgggatttaccttggaatatg ctaatgcatgcggctatgctaagctgggat ccgatgacaatgcggctatgctaatgcggctatgcaagctgggatccg atgactatgctaagctgcggctatgctaatgcggctatgctaagctcatgcg gctatgctaagctgggaatgcggctatgctaagctgggatccgatgacaatg catgcggctatgctaatgcggctatgcaagctgggatccgatgactatgcta agctgcggctatgctaatgcggctatgctaagctcggctatgctaatg gtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcggct atgctaatggtcttgggatttaccttggaatatgctaatgcggctatg ctaagctgggaatgcggctatgctaagctgggatccgatgacaatgcg gctatgctaatgcggctatgcaagctgggatccgatgactatgctaagctgc ggctatgctaatgcggctatgctaagctcatgcgg

Gene Prediction: Computational Challenge aatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcggctatgcaagctgggatccgatgactatgct aagctgggatccgatgacaatgcggctatgctaatggtcttgggattt accttggaatgctaagctgggatccgatgacaatgcggctatgctaatgaat ggtcttgggatttaccttggaatatgctaatgcggctatgctaagctgggat ccgatgacaatgcggctatgctaatgcggctatgcaagctgggatccg atgactatgctaagctgcggctatgctaatgcggctatgctaagctgggatc cgatgacaatgcggctatgctaatgcggctatgcaagctgggatcctg cggctatgctaatggtcttgggatttaccttggaatgctaagctgggatccg atgacaatgcggctatgctaatggtcttgggatttaccttggaatatg

Gene Prediction: Computational Challenge aatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcggctatgcaagctgggatccgatgactatgct aagctgggatccgatgacaatgcggctatgctaatggtcttgggattt accttggaatgctaagctgggatccgatgacaatgcggctatgctaatgaat ggtcttgggatttaccttggaatatgctaatgcggctatgctaagctgggat ccgatgacaatgcggctatgctaatgcggctatgcaagctgggatccg atgactatgctaagctgcggctatgctaatgcggctatgctaagctgggatc cgatgacaatgcggctatgctaatgcggctatgcaagctgggatcctg cggctatgctaatggtcttgggatttaccttggaatgctaagctgggatccg atgacaatgcggctatgctaatggtcttgggatttaccttggaatatg ctaatgcatgcggctatgctaagctgggat ccgatgacaatgcggctatgctaatgcggctatgcaagctgggatccg atgactatgctaagctgcggctatgctaatgcggctatgctaagctcatgcg gctatgctaagctgggaatgcggctatgctaagctgggatccgatgacaatg catgcggctatgctaatgcggctatgcaagctgggatccgatgactatgcta agctgcggctatgctaatgcggctatgctaagctcggctatgctaatg gtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcggct atgctaatggtcttgggatttaccttggaatatgctaatgcggctatg ctaagctgggaatgcggctatgctaagctgggatccgatgacaatgcg gctatgctaatgcggctatgcaagctgggatccgatgactatgctaagctgc ggctatgctaatgcggctatgctaagctcatgcgg

Gene Prediction: Computational Challenge aatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcggctatgcaagctgggatccgatgactatgct aagctgggatccgatgacaatgcggctatgctaatggtcttgggattt accttggaatgctaagctgggatccgatgacaatgcggctatgctaatgaat ggtcttgggatttaccttggaatatgctaatgcggctatgctaagctgggat ccgatgacaatgcggctatgctaatgcggctatgcaagctgggatccg atgactatgctaagctgcggctatgctaatgcggctatgctaagctgggatc cgatgacaatgcggctatgctaatgcggctatgcaagctgggatcctg cggctatgctaatggtcttgggatttaccttggaatgctaagctgggatccg atgacaatgcggctatgctaatggtcttgggatttaccttggaatatg

Gene Prediction: Computational Challenge aatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcggctatgcaagctgggatccgatgactatgct aagctgggatccgatgacaatgcggctatgctaatggtcttgggattt accttggaatgctaagctgggatccgatgacaatgcggctatgctaatgaat ggtcttgggatttaccttggaatatgctaatgcggctatgctaagctgggat ccgatgacaatgcggctatgctaatgcggctatgcaagctgggatccg atgactatgctaagctgcggctatgctaatgcggctatgctaagctgggatc cgatgacaatgcggctatgctaatgcggctatgcaagctgggatcctg cggctatgctaatggtcttgggatttaccttggaatgctaagctgggatccg atgacaatgcggctatgctaatggtcttgggatttaccttggaatatg ctaatgcatgcggctatgctaagctgggat ccgatgacaatgcggctatgctaatgcggctatgcaagctgggatccg atgactatgctaagctgcggctatgctaatgcggctatgctaagctcatgcg gctatgctaagctgggaatgcggctatgctaagctgggatccgatgacaatg catgcggctatgctaatgcggctatgcaagctgggatccgatgactatgcta agctgcggctatgctaatgcggctatgctaagctcggctatgctaatg gtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcggct atgctaatggtcttgggatttaccttggaatatgctaatgcggctatg ctaagctgggaatgcggctatgctaagctgggatccgatgacaatgcg gctatgctaatgcggctatgcaagctgggatccgatgactatgctaagctgc ggctatgctaatgcggctatgctaagctcatgcgg Gene!

Gene Prediction: Computational Challenge • Gene: A sequence of nucleotides coding for protein •

Gene Prediction: Computational Challenge • Gene: A sequence of nucleotides coding for protein • Gene Prediction Problem: Determine the beginning and end positions of genes in a genome

Central Dogma: DNA -> RNA -> Protein DNA CCTGAGCCAACTATTGATGAA transcription RNA CCUGAGCCAACUAUUGAUGAA translation Protein

Central Dogma: DNA -> RNA -> Protein DNA CCTGAGCCAACTATTGATGAA transcription RNA CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE

Translating Nucleotides into Amino Acids • Codon: 3 consecutive nucleotides • 4 3 =

Translating Nucleotides into Amino Acids • Codon: 3 consecutive nucleotides • 4 3 = 64 possible codons • Genetic code is degenerative and redundant – Includes start and stop codons – An amino acid may be coded by more than one codon (codon degeneracy)

Codons • In 1961 Sydney Brenner and Francis Crick discovered frameshift mutations • Systematically

Codons • In 1961 Sydney Brenner and Francis Crick discovered frameshift mutations • Systematically deleted nucleotides from DNA – Single and double deletions dramatically altered protein product – Effects of triple deletions were minor – Conclusion: every triplet of nucleotides, each codon, codes for exactly one amino acid in a protein

Genetic Code and Stop Codons UAA, UAG and UGA correspond to 3 Stop codons

Genetic Code and Stop Codons UAA, UAG and UGA correspond to 3 Stop codons that (together with Start codon ATG) delineate Open Reading Frames

Six Frames in a DNA Sequence CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG • stop codons – TAA, TAG,

Six Frames in a DNA Sequence CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG • stop codons – TAA, TAG, TGA • start codons - ATG

Open Reading Frames (ORFs) • Detect potential coding regions by looking at ORFs –

Open Reading Frames (ORFs) • Detect potential coding regions by looking at ORFs – A genome of length n is comprised of (n/3) codons – Stop codons break genome into segments between consecutive Stop codons – The subsegments of these that start from the Start codon (ATG) are ORFs • ORFs in different frames may overlap ATG TGA Genomic Sequence Open reading frame

Long vs. Short ORFs • Long open reading frames may be a gene –

Long vs. Short ORFs • Long open reading frames may be a gene – At random, we should expect one stop codon every (64/3) ~= 21 codons – However, genes are usually much longer than this • A basic approach is to scan for ORFs whose length exceeds certain threshold – This is naïve because some genes (e. g. some neural and immune system genes) are relatively short

Testing ORFs: Codon Usage • Create a 64 -element hash table and count the

Testing ORFs: Codon Usage • Create a 64 -element hash table and count the frequencies of codons in an ORF • Amino acids typically have more than one codon, but in nature certain codons are more in use • Uneven use of the codons may characterize a real gene • This compensate for pitfalls of the ORF length test

Codon Usage in Human Genome

Codon Usage in Human Genome

Codon Usage in Mouse Genome AA codon Ser TCG Ser TCA Ser TCT Ser

Codon Usage in Mouse Genome AA codon Ser TCG Ser TCA Ser TCT Ser TCC Ser AGT Ser AGC /1000 4. 31 11. 44 15. 70 17. 92 12. 25 19. 54 frac 0. 05 0. 14 0. 19 0. 22 0. 15 0. 24 Pro Pro 6. 33 17. 10 18. 31 18. 42 0. 11 0. 28 0. 30 0. 31 CCG CCA CCT CCC AA codon Leu CTG Leu CTA Leu CTT Leu CTC /1000 39. 95 7. 89 12. 97 20. 04 frac 0. 40 0. 08 0. 13 0. 20 Ala Ala GCG GCA GCT GCC 6. 72 15. 80 20. 12 26. 51 0. 10 0. 23 0. 29 0. 38 Gln CAG CAA 34. 18 11. 51 0. 75 0. 25

Transcription in prokaryotes Transcribed region start codon stop codon Coding region 5’ Promoter 3’

Transcription in prokaryotes Transcribed region start codon stop codon Coding region 5’ Promoter 3’ Untranslated regions Transcription start side upstream downstream Transcription stop side

Microbial gene finding • Microbial genome tends to be gene rich (80%-90% of the

Microbial gene finding • Microbial genome tends to be gene rich (80%-90% of the sequence is coding sequence) • Major problem – finding genes without known homologue.

Open Reading Frame (ORF) is a sequence of codons which starts with start codon,

Open Reading Frame (ORF) is a sequence of codons which starts with start codon, ends with a stop codon and has no stop codons in-between. Searching for ORFs – consider all 6 possible reading frames: 3 forward and 3 reverse Is the ORF a coding sequence? 1. Must be long enough (roughly 300 bp or more) 2. Should have average amino-acid composition specific for a given organism. 3. Should have codon usage specific for the given organism.

Gene finding using codon frequency Input sequence frequency in coding region frequency in non-coding

Gene finding using codon frequency Input sequence frequency in coding region frequency in non-coding region Compare Coding region or non-coding region

Example Codon position 1 A C T G 28% 33% 18% 21% 2 32%

Example Codon position 1 A C T G 28% 33% 18% 21% 2 32% 16% 21% 32% 3 33% 15% 14% 38% frequency 31% 18% 19% 31% in genome Assume: bases making codon are independent P(x|in coding) P(x|random) = P(Ai at ith position) P i P(Ai in the sequence) Score of AAAGAT: . 28*. 32*. 33*. 21*. 26*. 14. 31*. 19

Using codon frequency to find correct reading frame Consider sequence x 1 x 2

Using codon frequency to find correct reading frame Consider sequence x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9…. where xi is a nucleotide let p 1 = p x 1 x 2 x 3 p x 3 x 4 x 5…. p 2 = p x 2 x 3 x 4 p x 5 x 6 x 7…. p 3 = p x 3 x 4 x 5 p x 6 x 7 x 8…. then probability that ith reading frame is the coding frame is: pi Pi = p + p 1 2 3 Algorithm: • slide a window along the sequence and compute Pi • Plot the results

Eukaryotic gene finding • On average, vertebrate gene is about 30 KB long •

Eukaryotic gene finding • On average, vertebrate gene is about 30 KB long • Coding region takes about 1 KB • Exon sizes vary from double digit numbers to kilobases • An average 5’ UTR is about 750 bp • An average 3’UTR is about 450 bp but both can be much longer.

Exons and Introns • In eukaryotes, the gene is a combination of coding segments

Exons and Introns • In eukaryotes, the gene is a combination of coding segments (exons) that are interrupted by non-coding segments (introns) • This makes computational gene prediction in eukaryotes even more difficult • Prokaryotes don’t have introns - Genes in prokaryotes are continuous

Gene Structure

Gene Structure

Gene structure in eukaryotes exons Final exon Initial exon Transcribed region start codon stop

Gene structure in eukaryotes exons Final exon Initial exon Transcribed region start codon stop codon 3’ 5’ GT Promoter AG Untranslated regions Transcription stop side Transcription start side donor and acceptor sides

Central Dogma and Splicing intron 1 exon 2 intron 2 exon 3 transcription splicing

Central Dogma and Splicing intron 1 exon 2 intron 2 exon 3 transcription splicing translation exon = coding intron = non-coding

Splicing Signals Exons are interspersed with introns and typically flanked by GT and AG

Splicing Signals Exons are interspersed with introns and typically flanked by GT and AG

Splice site detection 5’ Donor site Position % 3’

Splice site detection 5’ Donor site Position % 3’

Consensus splice sites Donor: 7. 9 bits Acceptor: 9. 4 bits

Consensus splice sites Donor: 7. 9 bits Acceptor: 9. 4 bits

Promoters • Promoters are DNA segments upstream of transcripts that initiate transcription Promoter 5’

Promoters • Promoters are DNA segments upstream of transcripts that initiate transcription Promoter 5’ 3’ • Promoter attracts RNA Polymerase to the transcription start site

Two Approaches to Eukaryotic Gene Prediction • Statistical: coding segments (exons) have typical sequences

Two Approaches to Eukaryotic Gene Prediction • Statistical: coding segments (exons) have typical sequences on either end and use different subwords than non-coding segments (introns). • Similarity-based: many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes.

Ribosomal Binding Site

Ribosomal Binding Site

Donor and Acceptor Sites: Motif Logos Donor: 7. 9 bits Acceptor: 9. 4 bits

Donor and Acceptor Sites: Motif Logos Donor: 7. 9 bits Acceptor: 9. 4 bits (Stephens & Schneider, 1996) (http: //www-lmmb. ncifcrf. gov/~toms/sequencelogo. html)

Similarity-based gene finding • Alignment of – Genomic sequence and (assembled) EST sequences –

Similarity-based gene finding • Alignment of – Genomic sequence and (assembled) EST sequences – Genomic sequence and known (similar) protein sequences – Two or more similar genomic sequences

Expressed Sequence Tags Cell or tissue Isolate m. RNA and Reverse transcribe into c.

Expressed Sequence Tags Cell or tissue Isolate m. RNA and Reverse transcribe into c. DNA db. EST Clone c. DNA into a vector to Make a c. DNA library Vectors Submit To db. EST 5’ EST 3’ Pick a clone And sequence the 5’ and 3’ Ends of c. DNA insert

Central Dogma and Splicing intron 1 exon 2 intron 2 exon 3 transcription splicing

Central Dogma and Splicing intron 1 exon 2 intron 2 exon 3 transcription splicing translation exon = coding intron = non-coding

Splicing Sequence Alignment Potential splicing sites

Splicing Sequence Alignment Potential splicing sites

Using Similarities to Find the Exon Structure • Human EST (m. RNA) sequence is

Using Similarities to Find the Exon Structure • Human EST (m. RNA) sequence is aligned to different locations in the human genome • Find the “best” path to reveal the exon structure of human gene EST sequence Human Genome

An annotated gene in human genome

An annotated gene in human genome