Lecture Power Point to accompany Molecular Biology Fourth

Lecture Power. Point to accompany Molecular Biology Fourth Edition Robert F. Weaver Chapter 24 Genomics, Proteomics, and Bioinformatics Copyright © The Mc. Graw-Hill Companies, Inc. Permission required for reproduction or display.

24. 1 Positional Cloning • Positional cloning is a method for discovery of genes involved in genetic traits • Positional cloning was very difficult in the absence of genomic information • Begins with mapping studies to pin down the location of the gene of interest to a relatively small region of DNA 2

Classical Tools of Positional Cloning • Mapping depends on a set of landmarks to which gene position can be related • Restriction Fragment Length Polymorphisms (RFLP) are landmarks with lengths of restriction fragments given by a specific enzyme vary from one individual to another • Exon Traps use a special vector to help clone exons only • Cp. G Islands are DNA regions containing unmethylated Cp. G sequences 3

Detecting RFLPs 4

Exon Trapping 5

Identifying the Gene Mutated in a Human Disease • Using RFLps, geneticists mapped the Huntington disease gene (HD) to a region near the end of chromosome 4 • Used an exon trap to identify the gene itself • Mutation causing the disease is an expansion of a CAG repeat from the normal range of 11 -34 copies to abnormal range of at least 38 copies • Extra repeats cause extra Glu inserted into huntingtin, product of the HD gene 6

24. 2 Sequencing Genomes • What information can be gleaned from genome sequence? – Location of exact coding regions for all the genes – Spatial relationships among all the genes and exact distances between them • How is coding region recognized? – Contains an ORF long enough to code for a phage protein – ORF must • Start with ATG triplet • End with stop codon – Phage or bacterial ORF is the same as a gene’s coding region 7

Phage X 174 Genome • First genome sequenced was a very simple one, phage X 174 – Completed by Sanger in 1977 – 5375 -nt complete • Note that some of these phage genes overlap 8

Genome Results • The base sequences of viruses and organisms that have been obtained range from: – – Phages Bacteria Animals Plants • A rough draft and finished versions of the human genome have also been obtained • Comparison of the genomes of closely related and more distantly related organisms can shed light on the evolution of these species 9

Sequencing Milestones 10

The Human Genome Project • In 1990, geneticists started to map and ultimately sequence the entire human genome • Original plan was systematic and conservative – Prepare genetic and physical maps of genome with markers to allow piecing DNA sequences together in proper order – Most sequencing would be done only after mapping was complete 11

1998 – Human Genome Project • Celera, a private, for-profit company, shocked genomic community by announcing Celera would complete a rough draft of human genome by 2000 • Method that would be used was shotgun sequencing, whole human genome would be chopped up and cloned – Clones sequenced randomly – Sequences would be pieced together using computer programs 12

Vectors for Large-Scale Genome Projects • Two high-capacity vectors have been used extensively in the Human Genome Project – Mapping was done mostly using the yeast artificial chromosome, accepts million base pairs – Sequencing with bacterial artificial chromosomes accepting about 300, 000 bp • BACs are more stable, easier to work with than YACs 13

Clone-by-Clone Strategy • Mapping the human genome requires a set of landmarks to which we can relate the positions of genes • Some of these markers are genes, many more are nameless stretches of DNA – RFLPs – VNTRs, variable number tandem repeats – STSs, sequence-tagged sites, expressedsequence tags and microsatellites 14

Variable Number Tandem Repeats • VNTRs derive from minisatellites, stretches of DNA that contain a short core sequence repeated over and over in tandem (head to tail) • The number of repeats of the core sequence in a VNTR is likely to be different from one individual to another – So VNTRs are highly polymorphic – This makes them relatively easy to map – Disadvantage as genetic markers as they tend to bunch together at chromosome ends 15

Sequence-Tagged Sites • STSs are short sequences – 60 -1000 bp long – Detectable by PCR • Can design short primers – Hybridize few hundred bp apart – Amplify a predictable length of DNA 16

Sequence-Tagged Sites Mapping 17

Microsatellites • STSs are very useful in physical mapping or locating specific sequences in the genome – Worthless as markers in traditional genetic mapping unless polymorphic • Microsatellites are a class of STSs that are highly polymorphic – Similar to minisatellites – Consist of a core sequence repeated over and over many times in a row – Core here is 2 -4 bp long, much shorter 18

Contig • A set of clones used by geneticists in physically mapping or sequencing a given region is called a contig – Contains contiguous (or overlapping) DNAs spanning long distances – Used like putting together a jigsaw puzzle – Easier to complete with bigger pieces – Helpful to assemble in overlapping fashion 19

Shotgun Sequencing Massive sequencing projects can take two forms: 1. Map-then-sequence strategy – Produces physical map of genome including STSs – Sequences clones (mostly BACs) used in mapping – Places sequences in order to be pieced together 2. In the shotgun approach – Assembles libraries of clones with different size inserts – Sequences the inserts at random – Relies on computer program to find areas of overlap among sequences and piece them together 20

Shotgun-Sequencing Method 21

Sequencing Standards • A “working draft” may be: – Only 90% complete – Error rate of up to 1% • A “final draft” (less consensus): – Error rate of less than 0. 01% – Should have as few gaps as possible • Some researchers require a “final draft” is not completely sequenced until every last gap is completed 22

Sequencing the Human Genome • First chromosome completed in the Human Genome Project was chromosome 22 in late 1999 • Second completed was chromosome 21 • These are the 2 smallest human autosomes, have very valuable sequence information 23

Chromosome 22 • Only the long arm (22 q) was sequenced • Short arm (22 p) is composed of pure heterochromatin, likely devoid of genes • 11 gaps remained in the sequence – 10 are gaps between contigs likely due to “unclonable” DNA – Other a 1. 5 -kb region of cloned DNA that resisted sequencing 24

Findings from Chromosome 22 1. We must learn to live with gaps in our sequence 2. 679 annotated genes categorized as: – 274 Known genes, previously identified – 150 Related genes, homologous to known genes – 148 Predicted genes, sequence homology to ESTs – 134 Pseudogenes, sequences are homologous to known genes, but contain defects that preclude proper expression 25

Contigs and Gaps 26

$More From Chromosome 22 3. Coding regions of genes account for only tiny fraction$

More From Chromosome 22 3. Coding regions of genes account for only tiny fraction of length of the chromosome • • • Annotated genes are 39% of total length Exons are only 3% Repeat sequences (Alu, LINEs, etc) are 41% 4. Rate of recombination varies across the chromosome • Long regions of low recombination interspersed with short regions where it is relatively frequent 27

Repetitive DNA Content 28

More From Chromosome 22 5. There are local and long-range duplications • Immunoglobin l locus • 36 gene segments are clustered together that can encode variable regions • 60 -kb region is duplicated with greater than 90% fidelity almost 12 Mb away • Duplications found in few copies, low-copy repeats 6. Large chunks of human chromosome 22 q are conserved in several different mouse chromosomes • 113 human genes with mouse orthologs mapped to mouse chromosomes 29

Homologs • Orthologs are homologous genes in different species that evolved from a common ancestor – 8 regions on 7 mouse chromosomes • Paralogs are homologous genes that evolved by gene duplication within a species • Homologs are any kind of homologous genes, both orthologs and paralogs 30

Regions of Conservation 31

Chromosome 21 • Human chromosome 21 q, and some of 21 p have been sequenced • Gaps remaining are relatively few and short • Sequence reveals a relative poverty of genes – 225 genes – 59 pseudogenes • All 24 genes known to be shared between mouse chromosome 10 and human chromosome 21 are in the same order in both chromosomes 32

The X Chromosome • The sequence of 151 Mb of human X chromosome (99. 3% of its euchromatin) revealed 1098 protein-encoding genes – 168 genes governing X-linked phenotype – Genes for 173 noncoding RNAs • Chromosome is rich in LINE 1 elements – May serve as way station for X inactivation mechanism in female cells 33

X Chromosome Orthologs • Comparison of the X chromosome sequence with the chicken whole genome confirmed that X (and partner Y) evolved from an ancestral pair of autosomes • Comparison of 3 mammalian X chromosome sequences demonstrate high degree of synteny among these chromosomes • This synteny likely reflects high degree of evolutionary pressue to keep order of genes on X chromosome relatively stable 34

Human Genome Project Status • Working draft of human genome reported by 2 groups allowed estimates that genome contains fewer genes than anticipated – 25, 000 to 40, 000 • About half the genome has derived from the action of transposons • Transposons themselves have contributed dozens of genes to the genome • Bacteria also have donated dozens of genes • Finished draft is much more accurate than working draft, but there are still gaps • Information also about gene birth and death during human evolution 35

Other Vertebrate Genomes • Comparing human genome with that of other vertebrates has taught us much about similarities and differences among genomes – Comparison has also helped to identify many human genes – In future, will likely help identify defective genes involved in human genetic diseases • Closely related species like mouse can be used to find when and where genes are expressed so predict when and where human genes are likely expressed 36

The Minimal Genome • It is possible to define the essential gene set of a simple organism – Mutate one gene at a time – See which genes are required for life • In theory, also possible to define the minimal genome= set of genes that is minimum required for life – Minimum genome likely larger than the essential gene set • In principle, possible to place minimal genome into a cell lacking genes of its own, create a new life form that can live and reproduce under lab conditions 37

The Barcode of Life • There is a movement which has begun to create a barcode to identify any species of life on earth • The first such barcode will consist of the sequence of a 648 -bp piece of mitochondrial COI gene from each organism • This sequence is sufficient to identify uniquely almost any organism • Other sequences will be worked out for plants and perhaps later for bacteria 38

24. 3 Applications of Genomics: Functional Genomics • Functional genomics refers to those areas that deal with the function or expression of genomes • All transcripts an organism makes at any given time is an organism’s transcriptome • Use of genomic information to block expression systematically is called genomic functional profiling • Study of structures and functions of the protein products of genomes is proteomics 39

Transcriptomics • This area is the study of all transcripts an organism makes at any given time • Create DNA microarrays and microchips that hold 1000 s of c. DNAs or oligos – Hybridize labeled RNAs from cells to these arrays or chips – Intensity of hybridization at each spot reveals the extent of expression of the corresponding gene • Microarray permits canvassing expression patterns of many genes at once • Clustering of expression of genes in time and space suggest products of these genes collaborate in some process 40

Oligonucleotides on a Glass Substrate 41

Serial Analysis of Gene Expression • Serial Analysis of Gene Expression (SAGE) allows us to determine: – Which genes are expressed in a given tissue – The extent of that expression • Short tags, characteristic of particular genes, are generated from c. DNAs and ligated together between linkers • These ligated tags are then sequenced to determine which genes are expressed and how abundantly 42

SAGE 43

Whole Chromosome Transcription Mapping • High density whole chromosome transcriptional mapping studies have shown a majority of sequences in cytoplasmic poly(A)RNAs derive from non-exon regions of human chromosomes • Almost half of the transcription from these same chromosomes is nonpolyadenylated • Results indicate that great majority of stable nuclear and cytoplasmic transcripts in these chromosomes come from regions outside exons • Helps to explain the great differences between species whose exons are almost identical 44

Transcription Maps 45

Genomic Functional Profiling • Genomic functional profiling can be performed in several ways – A type of mutation analysis, deletion analysis mutants created by replacing genes one at a time with antibiotic resistance gene flanked by oligomers serving as barcode for that mutant – A functional profile can be obtained by growing the whole group of mutants together under various conditions to see which mutants disappear most rapidly 46

RNAi Analysis • Another means of genomic functional analysis on complex organisms can be done by inactivating genes via RNAi • An application of this approach targeting the genes involved in early embryogenesis in C. elegans has identified: – 661 important genes – 326 are involved in embryogenesis 47

Tissue-Specific Functional Profiling • Tissue-specific expression profiling can be done by examining spectrum of m. RNAs whose levels are decreased by an exogenous mi. RNA • Then compare to the spectrum of expression of genes at the m. RNA level in various tissues • If that mi. RNA causes decrease in levels of m. RNAs naturally low in cells expressing the mi. RNA – Suggests that the mi. RNA is at least a partial cause of those natural low levels • This type of analysis has implicated – mi. R-124 in destabilizing m. RNAs in brain tissue – mi. R-1 in destabilizing m. RNAs in muscle tissue 48

Locating Target Sites for Transcription Factors • Chromatin immunoprecipitation followed by DNA microarray analysis can be used to identify DNAbinding sites for activators and other proteins • Small genome organisms - all of the intergenic regions can be included in the microarray • If genome is large, that is not practical • To narrow areas of interest can use Cp. G islands – These are associated with gene control regions – If timing/conditions of activator’s activity are known, control regions of genes known to be activated at those times, or under those conditions, can be used 49

In Situ Expression Analysis • The mouse can be used as a human surrogate in large-scale expression studies that would be ethically impossible to perform on humans • Scientists have studied the expression of almost all the mouse orthologs of the genes on human chromosome 21 – Expression followed through various stages of embryonic development – Catalogued the embryonic tissues in which these genes are expressed 50

Single-Nucleotide Polymorphisms • Single-nucleotide polymorphisms can probably account for many genetic conditions caused by single genes and even some by multiple genes • Might be able to predict response to a drug • Haplotype map with over 1 million SNPs makes it easier to sort out important SNPs from those with no effect 51

Structural Variation • Structural variation is a prominent source of variation in human genomes – – Insertions Deletions Inversions Rearrangements of DNA chunks • Some structural variation can in principle predispose certain people to contract diseases – Some variation is presumably benign – Some also is demonstrably beneficial 52

24. 4 Proteomics • The sum of all proteins produced by an organism is its proteosome • Study of these proteins, even smaller subsets, is called proteomics • Such studies give a more accurate picture of gene expression than transcriptomics studies do 53

Protein Separations and Analysis • Current research in proteomics requires first that proteins be resolved, sometimes on a massive scale – Best tool for separation of many proteins at once is 2 D gel electrophoresis • After separation, proteins must be identified – Best method of identification involves digestion of proteins one by one with proteases – Then identify the peptides by mass spectrometry • In the future, microchips with antibodies attached may allow analysis of proteins in complex mixtures without separation 54

MALDI-TOF Mass Spectrometry 55

Detecting Protein-Protein Interactions 56

Protein Interactions • Most proteins work with other proteins to perform their functions • Several techniques are available to probe these interactions • Yeast two-hybrid analysis has been used for some time, now other methods are available – Protein microarrays – Immunoaffinity chromatography with mass spectrometry – Other combinations 57

24. 5 Bioinformatics • Bioinformatics involves the building and use of biological databases – Some of these databases contain the DNA sequences of genomes – Essential for mining the massive amounts of biological data for meaningful knowledge about gene structure and expression 58

Finding Regulatory Motifs in Mammalian Genomes Using computational biology techniques, Lander and Kellis have discovered highly conserved sequence motifs in 4 mammalian species, including humans: – In the promoter regions, these motifs probably represent binding sites for transcription factors – 3’-UTRs motifs probably represent binding sites for mi. RNAs 59

Using the Databases • The National Center for Biological Information (NCBI) website contains a vast store of biological information, including genomic and proteomic data • Start with a sequence and discover gene to which it belongs, then compare that sequence with that of similar genes • Query the database with a topic for information • View structures of protein in 3 D by rotating the structure on your computer screen 60