Protein Structural Classification Structural Classification Databases SCOP CATH

Protein Structural Classification

• Structural Classification Databases – SCOP, CATH, FSSP • Sequence pairwise comparison – Smith-waterman, BLAST, PSI-BLAST, rank-propagation, SAM-T 98 • Discriminative classification – SVM pairwise, mismatch kernel, EMOTIF kernel, I-Site kernel, semi-supervised kernel

SCOP Fold Superfamily Negative Training Set Test Set Family Positive Training Set Positive Test Set Family : Sequence identity > 30% or functions and structures are very similar Superfamily : low sequence similarity but functional features suggest probable common evolutionary origin Common fold : same major secondary structures in same arrangement with the same topological connections

CATH • • • Class Architecture Topology Homologous Sequence family

Local alignment: Smith-Waterman algorithm • For two string x and y, a local alignment with gaps is: • The score is: • Smith-Waterman score: Thanks to Jean Philippe

BLAST: a heuristic algorithm for matching DNA/Protein sequences • Idea: True match are likely to contain a short stretch of identity • A list of ‘neighborhood words” of the query sequence • Search database with the list, whenever there is a match do a ‘hit extension’, stopping at the maximum scoring extension Altschul, Madden, Schaffer, Zhang etc. , 1997

PSI-BLAST: Position-specific iterated BLAST • Only extend those double hit within a certain range. • A gapped alignment uses dynamic programming to extend a central pair of aligned residues in both directions. • PSI-BLAST can takes PSSM as input to search database Altschul, Madden, Schaffer, Zhang etc. , 1997

Local and Global Consistency • Affinity matrix • D is a diagonal matrix • Iterate • F* is the limit of seuqnce {F(t)} Zhou, Bousquet, Lal, Weston, and Scholkopf, 2003

Rank propagation • Protein similarity network: – Graph nodes: protein sequences in the database – Directed edges: a exponential function of the PSIBLAST e-value (destination node as query) – Activation value at each node: the similarity to the query sequnce • Exploit the structure of the protein similarity network Weston, Elisseeff, Zhou, Leslie and Noble, 2004

SAM-T 98 • The first iteration: query sequence to search NR database using WU-BLASTP and build alignment for the found homologs • 2 nd-4 th iterations: take the alignment from the previous iterations to find more homologs with WU-BLASTP and update the alignment with the new homologs found. • Build a HMM from the final alignment. The HMM of query sequence is used to search database, or we can use query sequence to search against HMM database Karplus, Barrett and Hughey, 1999

To do it in a discriminative manner with SVM…

Fisher Kernel • A HMM (or more than one) is built for each family • Derive kernel function from the fisher scores of each sequence given a HMM H 1: Jaakkola, Diekhans and Haussler, 2000