NIGMS Protein Structure Initiative Target Selection Workshop ADDA

Definitions • Nrdbxx = nrdb where no two sequences are more than xx %

ADDA: clustering of domains into families • ADDA = Automatic Domain Definition Algorithm –

ADDA purity and domain size PFAM SCOP Accuracy of domain boundaries -Red: best possible

3 D coverage of model proteomes • PDB entries from May 2006 – Required

T. maritima would be covered by 1000 BIG families and is two thirds done

2. 0 -2. 4 domains per euk. gene; 1. 3 domains per prok. gene

Seven model genomes Human BIG target families are almost exclusively eukaryotespecific Human Universal BIG

Covering all modelling families will have astronomical cost • Nrdbxx updates; Nrdb 30 =

Fine-grained coverage • MF: Structural core shrinks rapidly below 30 % sequence identity –

Coarse-grained coverage • BIG/BIGGER: Homology detection – Difficulty of aligning remote homologues Shared sequence

Clustering PFAM families • Comparison of ADDA to PFAM-A resulted in extension but no

Conclusions • ADDA ~3000 human target families – ~40 % reduction in number of

Acknowledgements • Andreas Heger, Oxford University • Swapan Mallick, Ashwin Sivakumar, Chris Wilton, Institute

Slides: 15

Download presentation

NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki

Definitions • Nrdbxx = nrdb where no two sequences are more than xx % identical; redundant sequences are mapped to representative – Uniprot + Genpept + PIR + PDB + … – Nrdb 100 – Nrdb 90 – … – Nrdb 40 – Nrdb 30 = “modeling family” • Pairs. DB = database of all-against-all comparisons – Blast in nrdb 90, PSI-Blast in nrdb 40 • BIG = family detected by profile comparison – Profile needs seed set (alignment); automatic iterative profile construction has poor convergence – Profiles Partially overlapping neighbour sets Need to cluster sequences Clustering artefacts when true cluster shape is non-spherical

(graph) covering ≠ clustering ≠ classification • Incomplete detection of homologous set by profile models • Example: Urease et al. superfamily IDEAL REAL

ADDA: clustering of domains into families • ADDA = Automatic Domain Definition Algorithm – Heger & Holm (2003) J Mol Biol 328, 749 -767. – Heger & al (2005) Nucl. Acids Res. 33 Database Issue, D 188 -D 191. • Principles of ADDA – Blast all-against-all comparison in nrdb 90 – Domains are optimally covered by alignments • Complete domain coverage; every residue belongs to a domain – Minimum/maximum spanning tree of domains – Remove links where profile-profile score is below threshold – Connected components are domain families • Quality assessment – Most ADDA families are pure, containing one PFAM family or SCOP superfamily (plus previously unclassified members) – Occasionally members from different PFAM family are merged in one ADDA family (contamination or PFAM misclassification) – Domain size distribution is reasonable • For example, much less over-fragmentation than by Prodom algorithm

ADDA purity and domain size PFAM SCOP Accuracy of domain boundaries -Red: best possible in domain tree -Black: actually selected

3 D coverage of model proteomes • PDB entries from May 2006 – Required greater than 80 % overlap between PDB sequence and ADDA domain to call family structurally covered • ADDA domain families – BIG families • 28429 families have more than ten members in nrdb 100 – 2383 structurally covered BIG families • 8820 families have more than ten members in nrdb 40 – 1869 structurally covered BIG families • NCBI genome sets – H sapiens, C elegans, D melanogaster, A thaliana, E coli, B anthracis, T maritima – Mapped to ADDA families • 6770 BIG(nrdb 40) families occur in model genome set – 1705 structurally covered

T. maritima would be covered by 1000 BIG families and is two thirds done Model genome coverage – 28429 BIG families in nrdb 100

2. 0 -2. 4 domains per euk. gene; 1. 3 domains per prok. gene 6770 BIG families in nrdb 40 Multigene families in eukaryotes

Seven model genomes Human BIG target families are almost exclusively eukaryotespecific Human Universal BIG families are almost covered Worm, fly, plant 638 802 2412 Human 57 535 187 161 29 Worm, fly, plant 88 588 18 155 836 264 Prokaryotes (E coli, B anthracis, T maritima) 5065 white BIG target families Prokaryotes 1705 structurally covered BIG families

Covering all modelling families will have astronomical cost • Nrdbxx updates; Nrdb 30 = “modelling family”

Fine-grained coverage • MF: Structural core shrinks rapidly below 30 % sequence identity – Need less naïve modelling software capable of building those parts ab initio which are not covered by template – Misalignment is major source of error Transitive alignment covers more of the structurally equivalent core Error Rmsd/A Template 32 Misaligned 16 Loops 8 Backbone 4 Rotamers 2 Average coverage of structural core (152 pairs in 11 superfamilies): Transitive 51 % Global alignment (HMMer) 43 % Local alignment (PSI-Blast) 34 %

Coarse-grained coverage • BIG/BIGGER: Homology detection – Difficulty of aligning remote homologues Shared sequence motifs suggest conserved biochemical mechanism Functional classification – Sequence comparison only detects half of remote homologue pairs Structure comparison reveals missing links Transitive search for conserved motifs detects more remote homologues than profile-profile comparison

Clustering PFAM families • Comparison of ADDA to PFAM-A resulted in extension but no discovery of completely new large families • PFAM-A v. 19: 7340 families, 2451 covered according to PFAM’s assignments, 1396 families in 205 clans • Our method achieved 30 % coverage of clan relationships at 5 % error rate compared to 23 % coverage at 5 % error rate by profile-profile comparison – 1083 unclassified PFAMs linked to 205 known clans • 1219 white PFAMs linked to known structure in 155 clusters – 1256 PFAMs clustered in 470 predicted clans • 336 white PFAMs linked to known structure in 222 clusters – 3610 PFAMS remained singletons • 2352 white PFAMs • 2451 covered, ~1555 fold assignments, ~3334 targets

Conclusions • ADDA ~3000 human target families – ~40 % reduction in number of PFAM target families by fold assignment (based on sequence only) • Coarse-grained coverage yields information out of reach to sequence comparison – Need to improve measures of sequence similarity to infer homology • Sequence motif-based functional classification – Need to increase the radius of convergence in templatebased structure prediction • Protein complexes hypothesis-driven research – Large conformational changes – Multigene receptor-ligand pair discrimination involves rotations in docking orientation

Acknowledgements • Andreas Heger, Oxford University • Swapan Mallick, Ashwin Sivakumar, Chris Wilton, Institute of Biotechnology • Funding: Academy of Finland, Sigrid Juselius Foundation, EU