Gene Structure and Identification Eukaryotic Genes and Genomes

  • Slides: 32
Download presentation
Gene Structure and Identification • Eukaryotic Genes and Genomes • Gene Finding Previous reading:

Gene Structure and Identification • Eukaryotic Genes and Genomes • Gene Finding Previous reading: 1. 3, 9. 1 -9. 6 Reading: 10. 2, 10. 4, 10. 6 -8 BIO 520 Bioinformatics Jim Lund

Complex Genome DNA • ~10% highly repetitive (300 Mbp) – NOT GENES • ~25%

Complex Genome DNA • ~10% highly repetitive (300 Mbp) – NOT GENES • ~25% moderate repetitive (750 Mbp) – Some genes • ~10% exons and introns (355 Mbp) • 45% = ? – Regulatory regions – Intergenic regions

Eukaryotic Gene Expression Transcribed Region Enhancer Promoter Terminator Transcription Primary transcript 5’ Polypeptide Exon

Eukaryotic Gene Expression Transcribed Region Enhancer Promoter Terminator Transcription Primary transcript 5’ Polypeptide Exon 2, etc Cap Splice Cleave/Polyadenylate Translation C 3’ Intron 1 Exon 1 N RNA Polymerase II 7 m. G An Transport 7 m. G An

Yeast • ORFS = genes! Small ORFS (RNA genes) Regulatory Sequences

Yeast • ORFS = genes! Small ORFS (RNA genes) Regulatory Sequences

Eukaryotes, cont’d • Fungi – introns common, short relative to exons – promoter/enhancer –

Eukaryotes, cont’d • Fungi – introns common, short relative to exons – promoter/enhancer – genome dense • “large” Eukaryotes – introns common, LONGER than exons – promoter/enhancer – genome sparse

% of genes Intron Prevalence Introns

% of genes Intron Prevalence Introns

% of genes Intron Size Introns

% of genes Intron Size Introns

% of genes Exon Size Exon size (bps)

% of genes Exon Size Exon size (bps)

Fungi Sew together exons –ORF regions –consensus sequences –domain/polypeptide matches

Fungi Sew together exons –ORF regions –consensus sequences –domain/polypeptide matches

Exon/Intron Structure CCACATTgtn(30 -10, 000)an(5 -20)ag. CAGAA. . . CCACATTCAGAA. . . Pro. His.

Exon/Intron Structure CCACATTgtn(30 -10, 000)an(5 -20)ag. CAGAA. . . CCACATTCAGAA. . . Pro. His. Ser. Glu. . .

Alternative Splice CCACATTgtn(30 -10, 000)an(5 -20)agcag. AA . . . CCACATTAA. . . Pro.

Alternative Splice CCACATTgtn(30 -10, 000)an(5 -20)agcag. AA . . . CCACATTAA. . . Pro. His. STOP

Gene prediction targets • • Internal exons (donor-acceptor) Initial exons (5’-donor) Terminal exons (acceptor-3’)

Gene prediction targets • • Internal exons (donor-acceptor) Initial exons (5’-donor) Terminal exons (acceptor-3’) Single exon genes (5’-3’)

Gene prediction • Sequence based – Consensus sites – Signal sequences • Homology –

Gene prediction • Sequence based – Consensus sites – Signal sequences • Homology – Confirm prediction is a protein • Known coding sequences – c. DNAs, SAGE • Comparative analysis – Identify exons, promoter/enhancer elements

Codon Bias/Nucleotide Frequency-useful? • High bias = high confidence • Low bias = low

Codon Bias/Nucleotide Frequency-useful? • High bias = high confidence • Low bias = low confidence

Finding Functional Sequences Known Consensus Sequences Consensus Sequence Generation Functional Tests

Finding Functional Sequences Known Consensus Sequences Consensus Sequence Generation Functional Tests

Describing consensus sequences • Position Weight Matrices – Sequence Logos • Hidden Markov Models

Describing consensus sequences • Position Weight Matrices – Sequence Logos • Hidden Markov Models

Translation Initiation Sites

Translation Initiation Sites

Splicing Consensus A 64 G 73 GTA 62 A 68 G 84 T 63…

Splicing Consensus A 64 G 73 GTA 62 A 68 G 84 T 63… Y 80 NY 80 Y 87 R 75 AY 95…C 65 AGNN Vertebrate GTRNGT(N){30 -1000} CTRAC(N){5 -15}YAG Fungi

Linguistic approach to combining gene features • Non-repetitive DNA!! • Long ORF – similar

Linguistic approach to combining gene features • Non-repetitive DNA!! • Long ORF – similar to known protein • ORF extended by “reasonable” splices • ORF begins with “good” ATG • Promoter/terminator flanks

DATABASE SEARCH • BLASTN – DNA: DNA comparison (ALWAYS!) – Not sensitive (DNA conservation

DATABASE SEARCH • BLASTN – DNA: DNA comparison (ALWAYS!) – Not sensitive (DNA conservation low) • BLASTX/TBLASTX – 6 frame ORFS: polypeptide database – 6 frames vs. 6 frames of a DNA database

Protein Database Matches Very helpful for the “known” What about the unknown? ? ?

Protein Database Matches Very helpful for the “known” What about the unknown? ? ?

Transcript Initiation • Basal Promoters • Enhancers/Silencers/Regulatory Sites – Boundary elements? • Transcription Initation

Transcript Initiation • Basal Promoters • Enhancers/Silencers/Regulatory Sites – Boundary elements? • Transcription Initation Prokaryotes vs Eukaryotes Organism-to-Organism

Basal Promoter Analysis Myers and Maniatis, Genes VI, 831 • • TATA-box CCAAT-box GC-box

Basal Promoter Analysis Myers and Maniatis, Genes VI, 831 • • TATA-box CCAAT-box GC-box KCWKYYYY GC -25 to -30 TBP -212 to -57 CTF/NF 1 -164 to +1 SP 1 +1 to +5 cap signal +1 CAAT TATA

Basal Promoter Analysis Cao and Moi, Ped Res 51: 415 -421 (2002)

Basal Promoter Analysis Cao and Moi, Ped Res 51: 415 -421 (2002)

m. RNA processing • Exon/Intron – Alternate splicing • Polyadenylation/Cleavage • Stability

m. RNA processing • Exon/Intron – Alternate splicing • Polyadenylation/Cleavage • Stability

Poly. A sites • Metazoans – AATAAA, ATTAAA • 15 -20 bps 5’ of

Poly. A sites • Metazoans – AATAAA, ATTAAA • 15 -20 bps 5’ of poly. A addition site. – YGTGTTYY (diffusive GT-rich sequence) – 100 -700 bps 3’ UTR typical. • Yeast -> different

Translation • Initiation site – 1 st AUG used 95% of the time. •

Translation • Initiation site – 1 st AUG used 95% of the time. • Translational regulatory elements – translational enhancers – upstream ORFs

Tools-WWW • • • Genscan Genie GRAIL II: integrated gene parsing Gen. Lang HMMGene

Tools-WWW • • • Genscan Genie GRAIL II: integrated gene parsing Gen. Lang HMMGene (lock ESTs, etc. ) GENEMARK

Hidden Markov Models • Probabilistic Models – Applicable to linear sequences – P(all states)=1,

Hidden Markov Models • Probabilistic Models – Applicable to linear sequences – P(all states)=1, infer probabilities of all states from observed (hidden states unobserved) – Work best when local correlations unimportant • Genefinding, phylogeny, secondary structure, genetic mapping • Parameters are set using a “Training Set” of gene annotations • Quantitative probabilities

Accuracy Assessment PP=predicted coding AP=“real” positive TP=number correct positive TN=number correct negative Sensitivity=Sn=TP/AP Specificity=Sp=TP/PP

Accuracy Assessment PP=predicted coding AP=“real” positive TP=number correct positive TN=number correct negative Sensitivity=Sn=TP/AP Specificity=Sp=TP/PP FP=number false positive FN=number false negative Approximate Correlation (AC) = ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FN))) / 2 - 1

Accuracy Levels Bp Exon

Accuracy Levels Bp Exon

NEXT • Regulatory Sequences – Known Consensus Sequences – Consensus Sequence Generation – Functional

NEXT • Regulatory Sequences – Known Consensus Sequences – Consensus Sequence Generation – Functional (Lab) Data • A few examples