Vector Base genome annotation Vector BaseEBI European Bioinformatics
Vector. Base genome annotation Vector. Base-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton UK Vector. Base SWG 2006 1
Overview of current annotation system Assembled genome Sequencing centre gene predictions Vector. Base gene predictions Merge into canonical set Protein analysis Display on genome browser Release to Gen. Bank/EMBL/DDBJ Vector. Base SWG 2006 2
Merging gene sets Gene set #1 Gene set #2 Reduce to single predictions per locus Compare exon/intron structures Identical structure s Compatib le structures Different structure s Merge/Spli t structures Complex No Map Add isoform predictions based on EST/Peptide data Canonical gene set Vector. Base SWG 2006 3
Data types used for gene prediction/validation Protein sequences ‘Self’ (i. e. species to be predicted) Taxonomic splits of Uniprot. KB Transcript sequences m. RNAs ESTs Evidence of expression Microarray SAGE tags Ditags MPSS Proteomics data Sequence statistics Coding potential Splice site prediction Vector. Base SWG 2006 4
Vector. Base gene prediction pipeline Blessed predictions Manual annotations (Apollo) Community (Genewise, Exonerate, submissions Apollo) Species-specific (Genewise) predictions nc. RNA (Rfam) predictions Transcript based (Exonerate) predictions Canonic al predictio ns Similarity (Genewise) predictions Protein family (Genewise) HMMs Ab initio gene (SNAP) predictions Vector. Base SWG 2006 5
Vector. Base curation database pipeline for manual/community annotation Manual annotation (Harvard) Apoll o Chado. XML Community annotation (in collaboration with Harvard) Community annotation (Community representatives) Curation warehouse db Chad o Chado. XML Apoll o GFF 3 Ensembl Gene build db Vector. Base SWG 2006 6
Overview of current re-annotation system Full gene build Partial Gene build Blessed genes New gene build Compar e Species-specific gene prediction Curren t gene set Updat ed gene set Vector. Base SWG 2006 Merge 7
Comparing new gene builds with the old one • Use of manual annotation for validation of automated gene build improvements • Simple statistics (CDS length, intron size, CDS matching TE’s) • BRC annotation metrics – Supporting evidence for a gene prediction (citation, expression, orthology) – Attachment of Standard Operating Procedures (SOPs) Vector. Base SWG 2006 8
Gene build schedules Full gene build Triggers for re-annotation 4 months • Temporal • Data • New EST data for species 1 month Partial gene build • New genomes • Re-annotated genomes Vector. Base SWG 2006 9
Vector. Base annotation capacity with increased number of genomes Gene builds per year per genome 2 full 2 partial 1 full 3 partial 1 full 2 partial 1 full 1 partial 2 genomes Yes Yes 3 genomes Yes Yes 4 genomes No Yes Yes 5 genomes No Yes Yes 6 genomes No No Yes 7 genomes No No No Yes 8 genomes No No Vector. Base SWG 2006 10
Re-annotation questions • Triggers for re-annotation – Strict temporal triggers • Always do a full gene build every year? – Data triggers • How much new data is enough? • Knock-on effects of related species (re)annotation? • Encouraging community submissions – How can we get more community annotation input? • Outreach at conferences (Roadshow) Vector. Base SWG 2006 11
- Slides: 11