Gencode Mar 10 Meeting Pseudogene Project Update Illustration

  • Slides: 22
Download presentation
Gencode Mar '10 Meeting: Pseudogene Project Update Illustration from Gerstein & Zheng (2006). Sci

Gencode Mar '10 Meeting: Pseudogene Project Update Illustration from Gerstein & Zheng (2006). Sci Am. Do not reproduce without permission 1 - Lectures. Gerstein. Lab. org (c) '09 Mark Gerstein

Overall Approach 1. Overall Pipeline runs at Yale and UCSC, yielding raw pseudogenes 2.

Overall Approach 1. Overall Pipeline runs at Yale and UCSC, yielding raw pseudogenes 2. Extraction of coherent subsets for further analysis and annotation 3. Passing to Sanger for detailed manual analysis and curation 4. Incorporation into final GENCODE annotation 5. Pipeline modification • Chronology of Sets 1. Encode Pilot 1% 2. Ribosomal Protein pseudogenes 3. Unitary pseudogenes (Hard) 4. Glycolytic Pseudogenes 5. Polymorphic Pseudogenes 6. Pseudogenes Associated with SDs Do not reproduce without permission 2 - Lectures. Gerstein. Lab. org • (c) '09 Overall Flow: Pipeline Runs, Coherent Sets, Annotation, Transfer to Sanger

Do not reproduce without permission 3 - Lectures. Gerstein. Lab. org (c) '09 Specific

Do not reproduce without permission 3 - Lectures. Gerstein. Lab. org (c) '09 Specific Pseudogene Assignments: Glycolytic Pseudogenes (completed)

Number of pseudogenes for each glycolytic enzyme [Liu et al. BMC Genomics ('09)] Large

Number of pseudogenes for each glycolytic enzyme [Liu et al. BMC Genomics ('09)] Large numbers of processed GAPDH pseudogenes in mammals comprise one of the biggest families but numbers not obviously correlated with m. RNA abundance. GAPDH Do not reproduce without permission 4 - Lectures. Gerstein. Lab. org (c) '09 Processed/Duplicated

Number of pseudogenes for each glycolytic enzyme [Liu et al. BMC Genomics ('09)] Large

Number of pseudogenes for each glycolytic enzyme [Liu et al. BMC Genomics ('09)] Large numbers of processed GAPDH pseudogenes in mammals comprise one of the biggest families but numbers not obviously correlated with m. RNA abundance. GAPDH 60 Proc/2 Dup Do not reproduce without permission 5 - Lectures. Gerstein. Lab. org (c) '09 Processed/Duplicated

Distribution of human GAPDH pseudogenes Large numbers of processed GAPDH pseudogenes in mammals comprise

Distribution of human GAPDH pseudogenes Large numbers of processed GAPDH pseudogenes in mammals comprise one of the biggest families but numbers not obviously correlated with m. RNA abundance. [Liu et al. BMC Genomics ('09, in press)] Do not reproduce without permission 6 - Lectures. Gerstein. Lab. org (c) '09 60 Proc/2 Dup

Aproximate Age of GAPDH pseudogenes 7 - Lectures. Gerstein. Lab. org Do not reproduce

Aproximate Age of GAPDH pseudogenes 7 - Lectures. Gerstein. Lab. org Do not reproduce without permission (c) '09 Age calculated based on Kimura-2 parameter model of nucleotide substitution [Liu et al. BMC Genomics ('09)] Burst of Retrotranspositional Activity

Synteny derived based on local gene orthology [Liu et al. BMC Genomics ('09)] Do

Synteny derived based on local gene orthology [Liu et al. BMC Genomics ('09)] Do not reproduce without permission 8 - Lectures. Gerstein. Lab. org (c) '09 Synteny of GAPDH pseudogenes

Do not reproduce without permission 9 - Lectures. Gerstein. Lab. org (c) '09 Specific

Do not reproduce without permission 9 - Lectures. Gerstein. Lab. org (c) '09 Specific Pseudogene Assignments: Unitary Pseudogenes (completed)

Pseudogenes ▪ { Unitary pseudogene Pseudogenes: nongenic DNA segments with high sequence similarity to

Pseudogenes ▪ { Unitary pseudogene Pseudogenes: nongenic DNA segments with high sequence similarity to functional genes Duplicated pseudogenes Duplication Transcription Transposition Processed pseudogenes Unitary pseudogenes: unprocessed pseudogenes with no functional counterparts Unitary pseudogenes In situ pseudogenization zdz © mmix ▪ + 10

Identification pipeline { Unitary pseudogene ~16 k human-mouse orthologs ~23 k mouse proteins ~6

Identification pipeline { Unitary pseudogene ~16 k human-mouse orthologs ~23 k mouse proteins ~6 k mouse proteins without human orthologs HG ~600 candidate human unitary pseudogene loci zdz © mmix 76 human unitary pseudogenes [Zhang et al. Genome. Biology (in press, '10)] 11

{ Unitary pseudogene zdz © mmix Relativity of unitary pseudogenes [Zhang et al. Genome.

{ Unitary pseudogene zdz © mmix Relativity of unitary pseudogenes [Zhang et al. Genome. Biology (in press, '10)] 12

Do not reproduce without permission 13 - Lectures. Gerstein. Lab. org (c) '09 Unitary

Do not reproduce without permission 13 - Lectures. Gerstein. Lab. org (c) '09 Unitary Pseudogene Families

{ Unitary pseudogene zdz © mmix Dating the pseudogenization events 14

{ Unitary pseudogene zdz © mmix Dating the pseudogenization events 14

Do not reproduce without permission 15 - Lectures. Gerstein. Lab. org (c) '09 Specific

Do not reproduce without permission 15 - Lectures. Gerstein. Lab. org (c) '09 Specific Pseudogene Assignments: Polymophic Pseudogenes (in process)

Do not reproduce without permission 16 - Lectures. Gerstein. Lab. org (c) '09 11

Do not reproduce without permission 16 - Lectures. Gerstein. Lab. org (c) '09 11 Polymorphic Pseudogenes

zdz © mmix Polymorphic pseudogenes (3 with allele frequency data) . . 3 SNPs

zdz © mmix Polymorphic pseudogenes (3 with allele frequency data) . . 3 SNPs not found to be under recent positive selection [Zhang et al. Genome. Biology (in press, '10)] 17

. . but population structure at rs 4940595—the difference in the allelic frequencies in

. . but population structure at rs 4940595—the difference in the allelic frequencies in different populations—could be result of different selective regimes that the same allele at rs 4940595 is subjected to in different population subdivisions. zdz © mmix Fst hierarchical clustering for rs 4940595 in SERPINB 11 18

Do not reproduce without permission 19 - Lectures. Gerstein. Lab. org (c) '09 Specific

Do not reproduce without permission 19 - Lectures. Gerstein. Lab. org (c) '09 Specific Pseudogene Assignments: SD-associated Pseudogenes (in process)

Segmental duplications (SDs) • Regions of the genome with 90% sequence identity and 1

Segmental duplications (SDs) • Regions of the genome with 90% sequence identity and 1 kb in length • Based on neutral divergence correspond to last ~40 million years of human evolution • Comprise ~5 -6% of the human genome • Enriched with genes (~18%) and pseudogenes (duplicated ~45%, processed ~22%) Can the study of ψgenes in SDs provide information not obvious from individual dataset ? Bailey et al, Science, 2002 20

Nucleotide substitutions in ψgenes and SDs containing them Parent gene Duplicated ψgene K 2

Nucleotide substitutions in ψgenes and SDs containing them Parent gene Duplicated ψgene K 2 m : Nucleotide substitutions per site computed using Kimura’s two parameter model Most ψgenes show the same number of substitutions as larger SD region containing them - Duplication accompanied by disablement - Followed by neutral rate of evolution 21

(c) '09 Pseudogene. org Do not reproduce without permission 22 - Lectures. Gerstein. Lab.

(c) '09 Pseudogene. org Do not reproduce without permission 22 - Lectures. Gerstein. Lab. org Z Zhang E Khurana Y J Liu YK Lam S Balasubramanian G Fang N Carriero R Robilotto P Cayting M Wilson A Frankish M Diekhans R Harte T Hubbard J Harrow Acknowledgements