CDD a conserved domain database Aron MarchlerBauer NCBI

CDD – a conserved domain database Aron Marchler-Bauer NCBI, National Library of Medicine, NIH DIMACS Workshop on Protein Domains: Identification, Classification and Evolution February 27 -28, 2003

CDD: a collection of domain multiple alignments linked to protein 3 D structure • imported alignment models mirrored ‘as-is’, sources are Pfam, Smart, COGs (close to 10, 000) • curated alignment models (about 300) • part of NCBI’s Entrez query/retrieval system • RPS-Blast to search PSSMs derived from alignment models

Entrez with CDD … MEDLINE Abstracts Term Frequency Statistics BLAST Sequence Similarity Protein Sequences 3 D Structures VAST Structure Similarity Domain Architecture Similarity CD-protein Links Protein Conserved Sequences Domains Related Conserved Domains

Conserved Domains as part of Entrez to …. . annotate three-dimensional structures

Conserved Domains as part of Entrez to …. . annotate protein sequences

Conserved Domains as part of Entrez to …. . neighbor proteins by domain architecture Currently (CDD v 1. 60): ~5 Mio protein-CD links

RPS-Blast (Reverse Position-Specific Blast) (Psi)-Blast RPS-Blast Search query (and its PSSM) against Search query protein sequence database against a database of PSSMs Lookup table holds possible word matches to query, database sequences are scanned for single or multiple word matches, which are then extended to identify statistically significant alignments. Lookup table holds possible word matches to database PSSMs, query sequences are scanned for single or multiple word matches, which are then extended to identify statistically significant alignments. How does it compare?

The effect of the search heuristics can be measured directly against IMPALA, a similar program using the rigorous Smith-Waterman algorithm. Test set: Smart v 3. 3, 569 Domain Families / Alignments / PSSMs 23736 protein sequences used in alignments 14100 protein sequences from the initial Drosophila genome set.

The effect of the search heuristics and the differences in alignment model encoding can be measured against HMMer Test set: Smart v 3. 3, 569 Domain Families / Alignments / PSSMs 23736 protein sequences used in alignments 14100 protein sequences from the initial Drosophila genome set.

RPS-Blast vs. IMPALA, Speed vs. Sensitivity

Self-recognition: Fraction of sequence fragments used to build up the alignment model, which yield significant scores when compared with the search model. Information content: sum Sp • log(p/q) across aligned columns • The average alignment information content for 568 models used in the test is 240 bits. • for 26 families – about 5% - self-recognition works better with IMPALA (detectable heuristics effects). • the average alignment information content for these 26 models is 100 bits. • for 542 models – about 95% - we did not detect heuristics effects in a self-recognition test.

• In 65 families (~11%) more than 5% difference in self recognition between HMMer and RPS-Blast • Their mean information content is 65 bits • In 503 families (89%) less than 5% difference in self recognition.

Conclusions: • differences, maybe not too surprising • affecting a fairly small subset of the models at the lower end of the ‘informativeness’ spectrum • can optimize PSSM calculation, but might see diminishing returns • it may be more effective to deal with scope of models Need to do something about the model collection … curation of alignment models

Recording conserved features in CDD …

Conserved Features in CDs: • catalytic, binding, interaction- and regulatory sites • explain observed patterns of sequence conservation • annotate if applicable to all aligned members • annotate if evidence is available (3 D structure, citation)

Collection has become redundant: Search results for 2 SRC (Tyrosine Kinase) Right now: about 9400 CD-CD links in Entrez

Collection has become redundant: Search results for 1 G 291 (Malk) Many ATP-ase domains are sequence-similar to each other, and possibly related by descent from a common ancestor How to explain this redundancy?

Curation: • literature check • examination of the conserved domain extent • examination of the multiple alignment, identification of a core substructure, establishment of a block-based alignment in agreement with 3 D-structure data • Feature annotation and recording of evidence • Investigation of ‘related’ domains and their apparent relationships, resolving and recording the family hierarchy • Update of CD alignment models with new sequences and 3 D-structure data

Curation needs to deal with: • noise from sequence data (gene models, annotation) • noise from alignments / alignment methods

Block alignment model and family hierarchies: Parent alignment Children: • Membership consistency • Alignment consistency

Rizzi and Schindelin, Curr. Opin. Struct. Biol. 2002, 12: 709 -720

. . sequences used in the alignment hit a variety of models in CD-Search:

… examine domain architectures as recorded in CDART:

… validate sequences, validate alignment block structure, and examine sequence tree:

Pfam PF 0994 Mo. CF_biosynth Moe. A_N Moe. A_C Moa. B COGs Cin. A Moe. A cd 00758 CDD cd 00758_a (Moa. B) cd 00758_c (Cin. A) cd 00758_b (Moe. A)

Concept borrowed from COGs – pattern of phylogenetic distribution as evidence for functional divergence after gene duplication events

Principles for establishing CD-Hierarchies: • Economy – too many families slow down search system • Search performance – flat alignment models must be split • Domain age – we’re primarily interested in sets of ancient conserved domains • Domain architectures • Subgroup-specific features 3. 5 bio 2. 6 bio 1. 7 bio Plants Animals Archaea Alpha-proteobact. Gram+

Future directions: ability to describe complex hierarchies, which will allow modeling of fusion events ABC_1 DEFG ABC_2 DEFG_1 DEFG_2

Credits: Steve Bryant Lewis Geer Siqian He David Hurwitz Christopher Lanczycki Charlie Liu Tom Madej Anna Panchenko Ben Shoemaker Vahan Simonyan Paul Thiessen Yanli Wang John Anderson Natalie Fedorova John Jackson Aviva Jacobs Cynthia Liebert Gabriele Marchler Raja Mazumder B. Sridhar Rao Carol De. Weese-Scott James Song Sona Vasudevan Roxanne Yamashita Jodie Yin PFAM SMART COGs BLAST team Entrez team Taxonomy team NCBI Help-Desk