CS 173 Lecture 3 Protein coding genes MW

  • Slides: 32
Download presentation
CS 173 Lecture 3: Protein coding genes MW 11: 00 -12: 15 in Beckman

CS 173 Lecture 3: Protein coding genes MW 11: 00 -12: 15 in Beckman B 302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 1

Annonuncements • http: //cs 173. stanford. edu/ is up – Course guidelines, lecture slides,

Annonuncements • http: //cs 173. stanford. edu/ is up – Course guidelines, lecture slides, etc. • Communications via Pizza – Private Q: post to “instructors” not “class” – Auditors sign up too – Office hours TBA before HW 1 • Project groups: TBD after “shopping season” • Tutorials: first three Wednesdays – Recommended to bring your laptop to UCSC tutorial 1/16 • We will be recruiting for our lab from class – Many other labs on campus would love to have you too! http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 2

ATATTGAATTTTCAAAAATTCTTACTTTTTGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATCAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA

ATATTGAATTTTCAAAAATTCTTACTTTTTGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATCAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA TTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA ATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT ATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTT TGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGT TCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATAC ATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT GCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTA CGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGA ATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACA TCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAAC GGACTTGAAGCCCGTCGAAAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAA CTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTG GCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTC TTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAAT TGAAATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT GCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT AATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCT TCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT AATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTA CTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTT ACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAA http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 3 AATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGT

Central Dogma of Biology

Central Dogma of Biology

Genomes, Genes & Proteins The most visible instructions in our genome are Genes explain

Genomes, Genes & Proteins The most visible instructions in our genome are Genes explain exactly HOW to synthesize any protein. Proteins are the work horses of every living cell. gene Genome: . . . ACGTACGACTAGCATCGACTAGCAC. . . protein http: //cs 173. stanford. edu [Bejerano. Winter 12/13] cell 5

Gene Structure http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 6

Gene Structure http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 6

Gene Processing http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 7

Gene Processing http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 7

Translation: The Genetic Code http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 8

Translation: The Genetic Code http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 8

The gene centric genome “The Genetic code” A gene centric term. For a gene

The gene centric genome “The Genetic code” A gene centric term. For a gene centric world. But fashions change. Controlled by mass media, technology, money, and a bit of scientific truth. http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 9

Visualizing Gene Structure http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 10

Visualizing Gene Structure http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 10

Genes in the Human Genome There are ~25, 000 protein coding genes in the

Genes in the Human Genome There are ~25, 000 protein coding genes in the human genome. (Even half way through sequencing the human genome, Researchers thought there will be well over 100, 000 genes). http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 11

Everything in Genomics is a Moving Target n n The genomes (ie, assemblies) Their

Everything in Genomics is a Moving Target n n The genomes (ie, assemblies) Their annotations Our understanding of Biology The portals Conclusion: write code that can be run. . . Why ~25, 000? and rerun http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 12

Gene Finding I: ab initio Challenge: “Find the genes, the whole genes, and nothing

Gene Finding I: ab initio Challenge: “Find the genes, the whole genes, and nothing but the genes” Understand Biology Write discovery tools (Our) answer depends on our understanding, data & tools http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 13

Gene (Protein really) Functions The most visible instructions in our genome are Genes explain

Gene (Protein really) Functions The most visible instructions in our genome are Genes explain exactly HOW to synthesize any protein. Proteins are the work horses of every living cell. gene Genome: . . . ACGTACGACTAGCATCGACTAGCAC. . . Just look at the cell. Lots and lots of different functions to perform. (“Only 20, 000 genes”. . ) protein http: //cs 173. stanford. edu [Bejerano. Winter 12/13] cell 14

First full draft of the Human Genome Consortium (HGC) http: //cs 173. stanford. edu

First full draft of the Human Genome Consortium (HGC) http: //cs 173. stanford. edu [Bejerano. Winter 12/13] Celera 2001 15

Biological Functions of the Human Gene Set Focus on the X axis: [HGC, 2001]

Biological Functions of the Human Gene Set Focus on the X axis: [HGC, 2001] http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 16

Molecular Functions of the Human Gene Set [Celera, 2001] http: //cs 173. stanford. edu

Molecular Functions of the Human Gene Set [Celera, 2001] http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 17

Biological vs. Molecular Function: Pathways Proteins with very different molecular functions participate to manifest

Biological vs. Molecular Function: Pathways Proteins with very different molecular functions participate to manifest a single biological function, for example: a pathway. http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 18

“Special” Function: Gene Regulation 2, 000 different proteins can bind specific DNA sequences. Proteins

“Special” Function: Gene Regulation 2, 000 different proteins can bind specific DNA sequences. Proteins DNA Protein binding site Gene DNA Proteins that regulate the transcription of other proteins are called transcription factors. http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 19

The Importance of Gene Regulation The looks & capabilities of different cells are determined

The Importance of Gene Regulation The looks & capabilities of different cells are determined by the subset of genes they express. Different cell types express very different gene repertoires (from the same genome). To change its behavior a cell can change its transcriptional program. Think of it as a giant state machine… http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 20

“Special” Function: Cell Signaling Cells also talk with each other. They send and receive

“Special” Function: Cell Signaling Cells also talk with each other. They send and receive messages, and change their behavior according to messages they receive. http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 21

Signal Transduction Now its an even bigger state machine of individual state machines (=cells)

Signal Transduction Now its an even bigger state machine of individual state machines (=cells) talking with each other, orchestrating their individual activities. http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 22

Back to Genes & Their Functions Gene (DNA) sequence determines protein (AA) sequence, which

Back to Genes & Their Functions Gene (DNA) sequence determines protein (AA) sequence, which determines protein (3 D) structure, which determines protein’s function. http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 23

Protein Folding Protein folding is the challenge of deducing protein structure from protein sequence.

Protein Folding Protein folding is the challenge of deducing protein structure from protein sequence. It’s a tough one… http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 24

Gene Families, Gene Names Genes (proteins) come in families. Genes of the same family

Gene Families, Gene Names Genes (proteins) come in families. Genes of the same family have similar sequences. Which is why the fold into similar structure and perform similar functions. Genes of the same family will typically have a “family name” followed by a (sequential) number or “first name”. http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 25

Alternative Splicing http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 26

Alternative Splicing http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 26

Genes in the Human Genome When you only show one transcript per gene locus:

Genes in the Human Genome When you only show one transcript per gene locus: If you ask the GUI to show you all well established gene variants: http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 27

Protein Domains SKSHSEAGSAFIQTQQLHAAMADTFLEHMCRLDIDSAPITARNTG IICTIGPASRSVETLKEMIKSGMNVARMNFSHGTHEYHAETIKNV RTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKK GATLKITLDNAYMAACDENILWLDYKNICKVVEVGSKVYVDDGLI SLQVKQKGPDFLVTEVENGGFLGSKKGVNLPGAAVDLPAVSEKDI QDLKFGVDEDVDMVFASFIRKAADVHEVRKILGEKGKNIKIISKI ENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMIIGR CNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIML SGETAKGDYPLEAVRMQHLIAREAEAAMFHRKLFEELARSSSHST DLMEAMAMGSVEASYKCLAAALIVLTESGRSAHQVARYRPRAPII AVTRNHQTARQAHLYRGIFPVVCKDPVQEAWAEDVDLRVNLAMNV GKAAGFFKKGDVVIVLTGWRPGSGFTNTMRVVPVP

Protein Domains SKSHSEAGSAFIQTQQLHAAMADTFLEHMCRLDIDSAPITARNTG IICTIGPASRSVETLKEMIKSGMNVARMNFSHGTHEYHAETIKNV RTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKK GATLKITLDNAYMAACDENILWLDYKNICKVVEVGSKVYVDDGLI SLQVKQKGPDFLVTEVENGGFLGSKKGVNLPGAAVDLPAVSEKDI QDLKFGVDEDVDMVFASFIRKAADVHEVRKILGEKGKNIKIISKI ENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMIIGR CNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIML SGETAKGDYPLEAVRMQHLIAREAEAAMFHRKLFEELARSSSHST DLMEAMAMGSVEASYKCLAAALIVLTESGRSAHQVARYRPRAPII AVTRNHQTARQAHLYRGIFPVVCKDPVQEAWAEDVDLRVNLAMNV GKAAGFFKKGDVVIVLTGWRPGSGFTNTMRVVPVP A protein domain is a subsequence of the protein that folds independently of the other portions of the sequence, and often confers to the protein one or more specific functions. http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 28

Alt. Splicing and Protein Repertoire Alternative splicing often produces protein variants that have a

Alt. Splicing and Protein Repertoire Alternative splicing often produces protein variants that have a different domain composition, and thus perform different functions. http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 29

Retroposed Genes and Pseudogenes (“dead genes”): Genomic sequences that resemble (originated from) genes that

Retroposed Genes and Pseudogenes (“dead genes”): Genomic sequences that resemble (originated from) genes that no longer make proteins. Retrogenes (“retrotranscribed”): Protein coding RNA that was reverse transcribed and inserted back into the genome. The RNA can be grabbed at any stage (partial/full transcript, before/during/after all introns are spliced). http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 30

Gene Ontologies 1. Make a controlled vocabulary of gene functions. 2. Annotate all genes

Gene Ontologies 1. Make a controlled vocabulary of gene functions. 2. Annotate all genes using this vocabulary. Map: genes papers biological functions. (plenty room for Natural Language Processing) Used to catalog human gene functions, and also which genes are expressed where, what defects have been found when certain genes are mutated, etc. http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 31

Review Lecture 3 • Central dogma recap – Focus on protein coding genes •

Review Lecture 3 • Central dogma recap – Focus on protein coding genes • Gene structure – exon, intron, 3’/5’ utr, CDS recap – The genetic code – UCSC genome browser sneak peak – human genome stats – Gene finding I: ab initio • Gene (protein) function – Cell structure, chemical reactions etc – Pathways (vs. function) – information processing roles • TFs • signaling: ligands, receptors, kinases • Gene families – similar sequence -> structure -> function – protein domains – splice variants, alt promoters • Special cases – Pseudogenes – Retroposed genes (and the distinction between the two) • Gene ontologies http: //cs 173. stanford. edu [Bejerano. Winter 12/13] 32