Human Genome Sequence and Variability Gabor T Marth
Human Genome Sequence and Variability Gabor T. Marth, D. Sc. Department of Biology, Boston College marth@bc. edu Medical Genomics Course – Debrecen, Hungary, May 2006
Lecture overview 1. Genome sequencing strategies, sequencing informatics 2. Genome annotation, functional and structural features in the human genome 3. Genome variability, DNA nucleotide, structural, and epigenetic variations
1. The Human genome sequence
The nuclear genome (chromosomes)
The genome sequence • the primary template on which to outline functional features of our genetic code (genes, regulatory elements, secondary structure, tertiary structure, etc. )
Completed genomes ~3, 000 Mb >100 Mb ~1 Mb
Main genome sequencing strategies Clone-based shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics, Inc.
Hierarchical genome sequencing BAC library construction clone mapping shotgun subclone library construction sequencing Lander et al. Nature 2001 sequence reconstruction (sequence assembly)
Clone mapping – “sequence ready” map
Hierarchical genome sequencing BAC library construction clone mapping shotgun subclone library construction sequencing/read processing Lander et al. Nature 2001 sequence reconstruction (sequence assembly)
Shotgun subclone library construction BAC primary clone cloning vector subclone insert sequencing vector
Hierarchical genome sequencing BAC library construction clone mapping shotgun subclone library construction sequencing/read processing Lander et al. Nature 2001 sequence reconstruction (sequence assembly)
Sequencing
Robotic automation Lander et al. Nature 2001
Base calling PHRED base = A Q = 40
Vector clipping
Hierarchical genome sequencing BAC library construction clone mapping shotgun subclone library construction sequencing/read processing Lander et al. Nature 2001 sequence reconstruction (sequence assembly)
Sequence assembly PHRAP
Repetitive DNA may confuse assembly
Sequence completion (finishing) gap region of low sequence coverage and/or quality CONSED, AUTOFINISH
2. Human genome annotation
Genome annotation – Goals protein coding genes repetitive elements GC content RNA genes
The starting material AGCGTGGTAGCGCGAGTTTGCGAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGT GCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGT AGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAG TCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTG GGGGGTACTCTCTCTAGAGCCTCTCAAAAAGCT CGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTAT ATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTTTTTTATGCT GATGATATAAATATATAGTGTTGGTGGGGGGTACTCTCTCT AGAGCCTCTCAAAAAGCTCGGGGATCGGGTTCGAAGA AGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCT
Coding genes – ab initio predictions Start codon Stop codon ATGGCACCACCGATGTCTACGTGGTAGGGGACTATAAAAAA Poly. A signal Open Reading Frame = ORF
Ab initio predictions Gene structure
Ab initio predictions …AGAATAGGGCGCGTACCTTCCAACGAAGACTGGG… splice donor site splice acceptor site
Ab initio predictions Genscan Grail Genie Gene. Finder Glimmer etc… EST_genome Sim 4 Spidey EXALIN
Homology based predictions known coding sequence from another organism ACGGAAGTCT expressed sequence GGACTATAAA ATGGCACCACCGATGTCTACGTGGTAGGGGACTATAAAAAA genes predicted by homology Genomescan Twinscan etc…
Consolidation – gene prediction systems Sim 4 db. Est Genewise Grail Genscan Fgenes. H Ensembl Otto
nc. RNA genes prediction based on structure (e. g. t. RNAs) for other novel nc. RNAs, only homology-based predictions have been successful
Repeat annotations Repeat annotation are based on sequence similarity to known repetitive elements in a repeat sequence library
The landscape of the human genome
Gene annotations – # of coding genes Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Gene annotations – gene length Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Gene annotations – gene function Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
GC content and coding potential Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
nc. RNAs Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Segmental duplications Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Repeat elements Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Genes and repeats
Physical vs. genetic map (Mb/c. M) 0. 4 c. M 1. 3 c. M 0. 7 c. M 0. 4 Mb 0. 7 Mb 0. 3 Mb
3. Human genome variability
DNA sequence variations • the reference Human genome sequence is 99. 9% common to each human being • sequence variations make our genetic makeup unique • the most abundant human variations are single-nucleotide polymorphisms (SNPs) – 10 million SNPs are currently known SNP
DNA sequence variations insertion-deletion (INDEL) polymorphisms
Structural variations Speicher & Carter, NRG 2005
Structural variations Feuk et al. Nature Reviews Genetics 7, 85– 97 (February 2006) | doi: 10. 1038/nrg 1767
Detection of structural variants Feuk et al. Nature Reviews Genetics 7, 85– 97 (February 2006) | doi: 10. 1038/nrg 1767
Epigenetic changes: chromatin structure Sproul, NRG 2005
Epigenetic changes: DNA methylation Laird, NRC 2003
- Slides: 49