Vector Base annotation metrics Daniel Lawson Vector BaseEBI
Vector. Base annotation metrics Daniel Lawson Vector. Base-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton UK Vector. Base BRC 4 2006 1
Topics • Annotation metrics – Numbers (Gene numbers & xrefs) – Data types (Availability & Integration) • Annotation SOPs – Genome specific – Gene build profile & prediction confidence Vector. Base BRC 4 2006 2
Aaeg. L 1. 1 Agam. P 3. 3 Yeast Worm Fly Human Gene count 16, 691 13, 765 7, 098 21, 105 14, 752 31, 206 15, 419 (92. 4 %) 13, 277 (96. 5 %) 6, 680 20, 060 14, 086 23, 245 1, 272 ( 7. 6 %) 488 (3. 5 %) 418 1, 045 666 7, 961 18, 061 14, 127 - - 16, 789 (93. 0 %) 13, 639 (96. 5 %) - - 1, 272 (7. 0 %) 488 (3. 5 %) - - Manually reviewed 0 (0. 0 %) 261 (1. 9 %) 6, 680 20, 060 14, 086 6, 995 Community input 0 (0. 0 %) 667 (4. 9 %) 4, 684 7, 228 9, 945 16, 887 Combined 11, 487 (74. 5) 9, 782 (73. 7 %) - - A. aegypti n/a 8, 907 (67. 1 %) 2, 202 4, 416 7, 991 6, 590 A. gambiae 9, 923 (54. 9 %) n/a 2, 228 4, 444 7, 702 6, 612 C. elegans 4, 923 (29. 5 %) 4, 442 (33. 4 %) 2, 185 n/a 4, 598 6, 121 D. melanogaster 9, 078 (50. 3 %) 7, 649 (57. 6 %) 2, 228 4, 543 n/a 6, 654 H. sapiens 5, 510 (33. 0 %) 5, 046 (38. 0 %) 2, 326 4, 473 5, 109 n/a S. cerevisiae 2, 520 (15. 1 %) 2, 350 (17. 7 %) n/a 2, 349 2, 470 3, 265 GO terms 9, 335 (51. 7 %) 7, 601 (55. 7 %) 4, 176 11, 334 10, 226 17, 000 EC numbers 2, 950 (16. 3 %) 2, 230 (16. 4 %) 4, 103 * 5, 240 * 4, 009 * 13, 245 * 11, 536 (74. 8 %) 9, 869 (72. 4 %) 4, 611 14, 730 10, 475 18, 199 Combined 12, 350 (80. 0 %) 7, 557 (55. 4 %) - - c. DNA/EST 9, 270 (60. 1 %) 7, 557 (55. 4 %) - - microarray 9, 143 (59. 2 %)† 0 (0. 0 %)‡ - - MPSS 3, 984 (25. 8 %)† n/a - - Protein-coding other Transcript count Protein-coding other Manual effort Orthologs Functional annotation Inter. Pro Expression evidence Vector. Base BRC 4 2006 3
Considerations • Importance of calculating all metrics using similar methodology from the same data set • Metrics calculated from Ensembl using Bio. Mart & raw SQL queries. • GO terms - many ways of calculating (Inter. Pro 2 GO, projection from Drosophila orthologs) • No Vector. Base capability to automatically assign EC numbers Vector. Base BRC 4 2006 4
Aaeg. L 1. 1 Agam. P 3. 3 Sequence Yes Download, search, visualization Polymorphisms No n/a Yes Search, visualization Genetic maps Yes Not integrated Yes Visualization Syntenic alignment Yes Visualization c. DNAs & ESTs Yes Download, search, visualization SAGE tags No n/a Microarrays Yes Visualization MPSS Yes Not integrated No n/a Proteomics No n/a Structures No n/a Interactome data No n/a Pathways No n/a Orthology profiles Yes Visualization Essentiality data No n/a Vector. Base BRC 4 2006 5
VB: SOP 010 Vector. Base gene prediction pipeline (SOP) Blessed predictions Manual annotations Community submissions VB: SOP 007 Similarity predictions VB: SOP 002 & SOP 003 Species-specific predictions VB: SOP 001 Canonical Gene set nc. RNA predictions VB: SOP 008 Transcript based predictions VB: SOP 004 Vector. Base BRC 4 2006 Protein family HMMs VB: SOP 009 Ab initio gene predictions VB: SOP 005 6
Assignment of SOPs to Vector. Base genes: Agam. P 3. 3 SOP No. genes VB: SOP 001 Confirmed 674 VB: SOP 002 Protein-based with transcript support 3765 VB: SOP 003 Protein-based 4830 VB: SOP 004 Transcript-based 2857 VB: SOP 005 Supported ab initio 585 VB: SOP 006 ab initio 0 VB: SOP 007 Manual annotation 928 Vector. Base BRC 4 2006 7
Display of Metrics & SOPs • Metrics – Vector. Base wiki – Species-page containing the three tables available from the Vector. Base species homepage – Expansion of documents relating to genomic resources (citations, links to primary data where possible) – Single collated table for BRC as separate download • SOPs – Vector. Base wiki – ‘Documents’ section of main site Vector. Base BRC 4 2006 8
Vector. Base BRC 4 2006 9
Manual annotation progress Protein-coding gene No. Vector. Base manual Community submission Anopheles gambiae Agam. P 3. 3 13, 277 current 261 ( 2. 0 %) 667 ( 5. 0 %) 2474 (18. 6 %) 667* ( 5. 0 %) 0 ( 0. 0 %) 341 ( 2. 2 %) Aedes aegypti Aaeg. L 1. 1 15, 419 current Vector. Base BRC 4 2006 10
Merging gene sets Gene set #1 Gene set #2 Reduce to single predictions per locus Compare exon/intron structures Identical structures Compatible structures Different structures Merge/Split structures Complex No Map Add isoform predictions based on EST/Peptide data Canonical gene set Vector. Base BRC 4 2006 11
- Slides: 11