Gene Structure and Identification Eukaryotic Genes and Genomes

  • Slides: 45
Download presentation
Gene Structure and Identification Eukaryotic Genes and Genomes Gene Finding 9/17/2020 Chuck Staben 1

Gene Structure and Identification Eukaryotic Genes and Genomes Gene Finding 9/17/2020 Chuck Staben 1

Complex Genome DNA • ~10% highly repetitive ( Mbp) – NOT GENES • ~25%

Complex Genome DNA • ~10% highly repetitive ( Mbp) – NOT GENES • ~25% moderate repetitive ( Mbp) – Some genes • ~25% exons and introns ( • 40%=? Mbp) – Regulatory regions – Intergenic regions 9/17/2020 Chuck Staben 2

Eukaryotic Gene Expression Engraved on your Brain!!!! 9/17/2020 Chuck Staben 3

Eukaryotic Gene Expression Engraved on your Brain!!!! 9/17/2020 Chuck Staben 3

Yeast ORFS=genes! What don’t you find this way? 9/17/2020 Chuck Staben 4

Yeast ORFS=genes! What don’t you find this way? 9/17/2020 Chuck Staben 4

Eukaryotes, cont’d • Fungi • “large” Eukaryotes – introns – promoter/enhancer • Where? –

Eukaryotes, cont’d • Fungi • “large” Eukaryotes – introns – promoter/enhancer • Where? – genome dense or sparse? – Intron average= – Exon average= – Promoter/enhancer • Where/how arranged – genome sparse 9/17/2020 Chuck Staben 5

Intron Prevalence 9/17/2020 Chuck Staben 6

Intron Prevalence 9/17/2020 Chuck Staben 6

Intron Size 9/17/2020 Chuck Staben 7

Intron Size 9/17/2020 Chuck Staben 7

Exon Size 9/17/2020 Chuck Staben 8

Exon Size 9/17/2020 Chuck Staben 8

Fungi Sew together exons –ORF regions –consensus sequences –domain/polypeptide matches 9/17/2020 Chuck Staben 9

Fungi Sew together exons –ORF regions –consensus sequences –domain/polypeptide matches 9/17/2020 Chuck Staben 9

Exon/Intron Structure CCACATTgtn(30 -10, 000)an(5 -20)ag. CAGAA …________. . . Pro. His. Ser. Glu.

Exon/Intron Structure CCACATTgtn(30 -10, 000)an(5 -20)ag. CAGAA …________. . . Pro. His. Ser. Glu. . . 9/17/2020 Chuck Staben 10

Alternative Splice CCACATTgtn(30 -10, 000)an(5 -20)agcag. AA . . . CCACATTAA. . . Pro.

Alternative Splice CCACATTgtn(30 -10, 000)an(5 -20)agcag. AA . . . CCACATTAA. . . Pro. His_____ 9/17/2020 Chuck Staben 11

Codon Bias/Nucleotide Frequency-useful? • Bias=0. 97 means______ • Bias=0. 03 means______ 9/17/2020 Chuck Staben

Codon Bias/Nucleotide Frequency-useful? • Bias=0. 97 means______ • Bias=0. 03 means______ 9/17/2020 Chuck Staben 12

Consensus Sequences • Promoter sites • Intron/Exon • Transcription Termination/Poly. A • Translation initation

Consensus Sequences • Promoter sites • Intron/Exon • Transcription Termination/Poly. A • Translation initation 9/17/2020 Chuck Staben 13

Finding Functional Sequences Known Consensus Sequences Consensus Sequence Generation Functional Tests 9/17/2020 Chuck Staben

Finding Functional Sequences Known Consensus Sequences Consensus Sequence Generation Functional Tests 9/17/2020 Chuck Staben 14

Consensus Inference • Position Weight Matrices • Sequence Logos Profile. Scan • Hidden Markov

Consensus Inference • Position Weight Matrices • Sequence Logos Profile. Scan • Hidden Markov Models 9/17/2020 Chuck Staben 15

Translation Initiation Sites 9/17/2020 Chuck Staben 16

Translation Initiation Sites 9/17/2020 Chuck Staben 16

Functional Assay CCATGG 100 CCCTGG 0 CCTTGG 5 CCATAG 0 CTATGG 90 CCATGA 85

Functional Assay CCATGG 100 CCCTGG 0 CCTTGG 5 CCATAG 0 CTATGG 90 CCATGA 85 9/17/2020 • Conservation • Correlated Positions Chuck Staben 17

Splicing Consensus A 64 G 73 GTA 62 A 68 G 84 T 63…

Splicing Consensus A 64 G 73 GTA 62 A 68 G 84 T 63… Y 80 NY 80 Y 87 R 75 AY 95…C 65 AGNN GTRNGT(N){30 -1000} CTRAC(N){5 -15}YAG 9/17/2020 Chuck Staben Vert Fungi 18

Linguistic Approach • Non-repetitive DNA!! • Long ORF – similar to known protein •

Linguistic Approach • Non-repetitive DNA!! • Long ORF – similar to known protein • ORF extended by “reasonable” splices • ORF begins with “good” ATG • Promoter/terminator flanks 9/17/2020 Chuck Staben 19

DATABASE SEARCH • BLASTN – What? – Limitations? • BLASTX/TBLASTX – BLASTX does? –

DATABASE SEARCH • BLASTN – What? – Limitations? • BLASTX/TBLASTX – BLASTX does? – TBALSTX? 9/17/2020 Chuck Staben 20

Protein Database Matches Great for the “known” What about the unknown? ? ? 9/17/2020

Protein Database Matches Great for the “known” What about the unknown? ? ? 9/17/2020 Chuck Staben 21

Transcript Initiation • Basal Promoters • Enhancers/Silencers/Regulatory Sites – Boundary elements? • Transcription Initation

Transcript Initiation • Basal Promoters • Enhancers/Silencers/Regulatory Sites – Boundary elements? • Transcription Initation Prokaryotes vs Eukaryotes Organism-to-Organism 9/17/2020 Chuck Staben 22

Basal Promoter Analysis Myers and Maniatis, Genes VI, 831 • ATATAA • GGCCAATC •

Basal Promoter Analysis Myers and Maniatis, Genes VI, 831 • ATATAA • GGCCAATC • GCCACACCC -30 TBP -75 CTF/NF 1 -90 SP 1 +1 GC 9/17/2020 CAAT TATA Chuck Staben 23

m. RNA processing • Exon/Intron –Alternate splicing • Polyadenylation/Cleavage • Stability 9/17/2020 Chuck Staben

m. RNA processing • Exon/Intron –Alternate splicing • Polyadenylation/Cleavage • Stability 9/17/2020 Chuck Staben 24

Poly A sites • Metazoans –AATAAA • Yeast-different 9/17/2020 Chuck Staben 25

Poly A sites • Metazoans –AATAAA • Yeast-different 9/17/2020 Chuck Staben 25

Translation • Initation site • (Frameshifting) • Translational regulatory elements – upstream ORFs –

Translation • Initation site • (Frameshifting) • Translational regulatory elements – upstream ORFs – translational enhancers 9/17/2020 Chuck Staben 26

Translation Sites • Initiate at 5’-ATG –upstream ORF…regulatory • (Frameshifting) • Translation enhancers…. 9/17/2020

Translation Sites • Initiate at 5’-ATG –upstream ORF…regulatory • (Frameshifting) • Translation enhancers…. 9/17/2020 Chuck Staben 27

Integrated Genefinding • Linguistic approach (our discussion) • Probabilistic approaches – Discriminant analyses –MARKOV

Integrated Genefinding • Linguistic approach (our discussion) • Probabilistic approaches – Discriminant analyses –MARKOV MODELS 9/17/2020 Chuck Staben 28

Tools-WWW • GRAIL II: integrated gene parsing • Gen. Lang • • GENIE HMMGene

Tools-WWW • GRAIL II: integrated gene parsing • Gen. Lang • • GENIE HMMGene (lock ESTs, etc. ) GENSCAN GENEMARK 9/17/2020 Chuck Staben 29

Hidden Markov Models • Probabilistic Models – Applicable to linear sequences – P(all states)=1,

Hidden Markov Models • Probabilistic Models – Applicable to linear sequences – P(all states)=1, infer probabilities of all states from observed (hidden states unobserved) – Work best when local correlations unimportant • Genefinding, phylogeny, secondary structure, genetic mapping • Work best with “Training Set” • Quantitative probabilities 9/17/2020 Chuck Staben 30

Accuracy Assessment PP=predicted coding PN=predicted non-coding AP=“real” positive AN=“’real” negatives TP=number correct positive TN=number

Accuracy Assessment PP=predicted coding PN=predicted non-coding AP=“real” positive AN=“’real” negatives TP=number correct positive TN=number correct negative FP=number false positive FN=number false negative Sn=TP/AP Sp=TP/PP AC = ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FN))) / 2 - 1 9/17/2020 Chuck Staben 31

Accuracy Levels DNA Sequence Error Rate!? ? 9/17/2020 Chuck Staben 32

Accuracy Levels DNA Sequence Error Rate!? ? 9/17/2020 Chuck Staben 32

NEXT • Regulatory Sequences – Known Consensus Sequences – Consensus Sequence Generation – Functional

NEXT • Regulatory Sequences – Known Consensus Sequences – Consensus Sequence Generation – Functional (Lab) Data • Real examples 9/17/2020 Chuck Staben 33

Gene Regulatory Sequences • Functional sites –Consensus –Experimental tests • Inferred sites –Transcriptome analysis

Gene Regulatory Sequences • Functional sites –Consensus –Experimental tests • Inferred sites –Transcriptome analysis 9/17/2020 Chuck Staben 34

Regulatory Sites • Transcript initiation • m. RNA processing • Translation sites 9/17/2020 Chuck

Regulatory Sites • Transcript initiation • m. RNA processing • Translation sites 9/17/2020 Chuck Staben 35

Regulatory Factors • lac. I, trp. R, CAP, ara. C…. • GAL 4, NDT

Regulatory Factors • lac. I, trp. R, CAP, ara. C…. • GAL 4, NDT 80… 9/17/2020 Chuck Staben 36

EUKARYOTES • • More complex signals More genes More dispersed signals Combinatoric regulation common

EUKARYOTES • • More complex signals More genes More dispersed signals Combinatoric regulation common 9/17/2020 Chuck Staben 37

Enhancer Elements • Octamer • Name some… • • • 9/17/2020 OCT 1, OCT

Enhancer Elements • Octamer • Name some… • • • 9/17/2020 OCT 1, OCT 2 Chuck Staben 38

Consensus Sequence Databases • WWW-based –TFD (transcription factor database) –BCM Search launcher 9/17/2020 Chuck

Consensus Sequence Databases • WWW-based –TFD (transcription factor database) –BCM Search launcher 9/17/2020 Chuck Staben 39

Transcriptome Analyses • Microarray transcription analysis • MEME analysis of clusters 9/17/2020 Chuck Staben

Transcriptome Analyses • Microarray transcription analysis • MEME analysis of clusters 9/17/2020 Chuck Staben 40

Practical Gene Finding • Use ALL tools – Comparative • BLASTN, BLASTX – Predictive:

Practical Gene Finding • Use ALL tools – Comparative • BLASTN, BLASTX – Predictive: Stitch together a consensus • HMM, GRAIL… • ORF finders • Findpatterns (and WWW pattern searches) • c. DNA OR protein OR genetic evidence 9/17/2020 Chuck Staben 41

FRAMES-aldolase gene 9/17/2020 Chuck Staben 42

FRAMES-aldolase gene 9/17/2020 Chuck Staben 42

If aldolase is so tough, how do you really do it? Combine DNA sequence

If aldolase is so tough, how do you really do it? Combine DNA sequence with other data! 9/17/2020 Chuck Staben 43

Genome-c. DNA P DNA sequencing Align (GAP) c. DNA 9/17/2020 Infer Promoter, Enhancer Test

Genome-c. DNA P DNA sequencing Align (GAP) c. DNA 9/17/2020 Infer Promoter, Enhancer Test in cis Chuck Staben 44

Comparative Genomics • Conservation of coding regions • Identification of transcription signals – “words”

Comparative Genomics • Conservation of coding regions • Identification of transcription signals – “words” in common 9/17/2020 Chuck Staben 45