Sequence Based Analysis Tutorial NIH Proteomics Workshop LaiSu
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su L. Yeh, Ph. D. Protein Information Resource at Georgetown University Medical Center
Retrieval, Sequence Search & Classification Methods Ø Retrieve protein info by text / UID Ø Sequence Similarity Search l BLAST, FASTA, Dynamic Programming Ø Family Classification l Patterns, Profiles, Hidden Markov Models, Sequence Alignments, Neural Networks Ø Integrated Search and Classification System 2
Sequence Similarity Search (I) Based on Pair-Wise Comparisons Ø Dynamic Programming Algorithms Ø l l Ø Global Similarity: Needleman-Wunch Local Similarity: Smith-Waterman Heuristic Algorithms l l l FASTA: Based on K-Tuples (2 -Amino Acid) BLAST: Triples of Conserved Amino Acids Gapped-BLAST: Allow Gaps in Segment Pairs PHI-BLAST: Pattern-Hit Initiated Search PSI-BLAST: Position-Specific Iterated Search 3
Sequence Similarity Search (II) Ø Similarity Search Parameters l l Ø Scoring Matrices – Based on Conserved Amino Acid Substitution • Dayhoff Mutation Matrix, e. g. , PAM 250 (~20% Identity) • Henikoff Matrix from Ungapped Alignments, e. g. , BLOSUM 62 Gap Penalty Search Time Comparisons l l l Smith-Waterman: 10 Min FASTA: 2 Min BLAST: 20 Sec 4
Feature Representation Features of Amino Acids: Physicochemical Properties, Context (Local & Global) Features, Evolutionary Features Ø Alternative Amino Acids: Classification of Amino Acids To Capture Different Features of Amino Acid Residues Ø 5
Substitution Matrix Ø Ø Likelihood of One Amino Acid Mutated into Another Over Evolutionary Time Negative Score: Unlikely to Happen (e. g. , Gly/Trp, -7) Positive Score: Conservative Substitution (e. g. , Lys/Arg, +3) High Score for Identical Matches: Rare Amino Acids (e. g. , Trp, Cys) 6
BLAST BALST (Basic Local Alignment Search Tool) Ø Extremely fast Ø Robust Ø Most frequently used It finds very short segment pairs (“seeds”) between the query and the database sequence These seeds are then extended in both directions until the maximum possible score for extensions of this particular seed is reached 7
BLAST Search Ø From BLAST Search Interface Ø Table-Format Result with BLAST Output and SSEARCH (Smith-Waterman) Pair-Wise Alignment Links to i. Pro. Class and Uni. Prot. KB reports Link to NCBI taxonomy Link to PIRSF report Click to see SSearch alignment Click to see 8 alignment
Blast Result & Pairwise Alignment BLAST Aligment 9
Classification Ø What is classification? Ø Why do we need protein classification? Ø Different levels of classification Ø Basis for functional protein classification Ø How to classify a protein of unknown function? 10
Classification Databases Ø Protein motif Ø Protein domain Ø 3 -D structure Ø Whole-protein Group proteins C - x(2, 4) - C - x(3) - [LIVMFYWC] - x(8) - H - x(3, 5) - H to The 2 C's and the 2 H's are zincaccording ligands the presence of a common domain Group proteins according to common 3 D structure common domain architecture and length 11
Family Classification Methods Based on Other Classification Information Ø Multiple Sequence Alignment (Clustal. W) Ø Pro. Site Pattern Search Ø Profile Search Ø Hidden Markov Models (HMMs) Domain (Pfam); Whole protein (PIRSF) Ø Neural Networks Ø 12
How do you build a tree? Ø Pick sequences to align Ø Align them Ø Verify the alignment Ø Keep the parts that are aligned correctly Ø Build and evaluate a phylogenetic tree Ø Integrated Analysis 13
Multiple Sequence Alignment Ø Clustal. W Progressive Pairwise Approach l Base on Exhaustive Pairwise Alignments Ø Neighbor Joining l Joining Order Corresponding to a Tree Ø Alignment Varies l Dependent on Joining Order Ø 14
Multiple Alignment and Tree Ø From Text/Sequence Search Result or Clustal. W Alignment Interface 15
16
Motif Patterns (Regular Expressions) Ø Signature Patterns for Functional Motifs Pro. Class Motif Alignments 17
PIR Pattern Search From Text/Sequence Search Result or Pattern Search Interface Ø One Query Sequence Against PROSITE Pattern Database Ø One Query Pattern (PROSITE or User-Defined) Against Sequence DB Ø 18
Ø Pattern Search Result (I) One Query Sequence Against PROSITE Pattern Database 19
Pattern Search Result (II) Ø One Query Pattern Against Sequence Database Display the query pattern 1 Sorting arrows 2 3 Links to i. Pro. Class and Uni. Prot. KB reports Link to NCBI taxonomy Link to PIRSF report 20
Profile Method Profile: A Table of Scores to Express Family Consensus Derived from Multiple Sequence Alignments l Num of Rows = Num of Aligned Positions l Each row contains a score for the alignment with each possible residue. Ø Profile Searching l Summation of Scores for Each Amino Acid Residue along Query Sequence l Higher Match Values at Conserved Positions Ø 21
1 PIRSF scan Ø Ø Search One Query Protein Against all the Full-length and Domain HMM models for the fully curated PIRSFs by HAMMER The matched regions and statistics will be displayed. Shows PIRSF that the query belongs to Statistical data for all domains Statistical data per domain Alignment with consensus sequence 22
Secondary Structure Features a Helix Patterns of Hydrophobic Residue Conservation Showing I, I+3, I+4, I+7 Pattern Are Highly Indicative of an a Helix (Amphipathic) Ø b Strands That Are Half Buried in the Protein Core Will Tend to Have Hydrophobic Residues at Positions I, I+2, I+4, I+6 Ø 23
3 D Structure Proteins share the same fold suggesting homology Gamma Crystallin C Beta B 1 Crystallin 24
Creation and Curation of PIRSFs 25
Integrated Bioinformatics System for Function and Pathway Discovery Data Integration Ø Associative Analysis Ø 26
Query Sequence Uni. Prot Family Classification & Functional Analysis BLAST Search HMM Domain Search Analytical Pipeline Top-Matched Superfamilies/Domains HMM Motif Search Pattern Search Signal. P/TMHMM Predicated Superfamilies/Domains/Motifs/Sites/Signal. Peptides/TMHs SSEARCH CLUSTALW Superfamily/Domain/Motif Alignments Family Relationships & Functional Features 27
Integrated Bioinformatics System Ø Global Bioinformatics Analysis of 1000’s of Genes and Proteins Ø Pathway Discovery, Target Identification 28
Lab Section 29
Text Search 30
Text Search Result (I) Extend your search or start over Choose columns to be displayed Expand view Pre-computed BLAST Results Links to i. Pro. Class and Uni. Prot. KB reports Link to NCBI taxonomy Link to PIRSF report 31
Text Search Result (III) Number of Related Seq. at 3 different E-value cut-offs 32
Text Search Result (II) Extend your search or start over Choose columns to be displayed Link to PIRSF report Curated domain architecture with links to Pfam database Extent of family curation 33
Peptide Search 34
Peptide Search & Results Sorting arrows Links to i. Pro. Class and Uni. Prot. KB reports Link to NCBI taxonomy Link to PIRSF report Matching peptide highlighted in the sequence 35
Batch Retrieval Results (I) 1 Retrieve more sequences Choose columns to be displayed 2 3 4 5 6 Links to i. Pro. Class and Uni. Prot. KB reports 36
Batch Retrieval Results (II) Retrieve more families 1 2 Choose columns to be displayed 3 4 5 6 Links PIRSF reports Curated domain architecture (N- to C- termini) with links to Pfam database 37
Blast Similarity Search 38
Blast / Related Sequences Results 40
Blast Result & Pairwise Alignment BLAST Aligment 41
Pairwise Alignment 42
Multiple Alignment Interactive Phylogenetic Tree and Alignment 43
Phylogenetic Tree and Alignment View 44
Pattern Search (I) 45
Pattern Search (II) Display the query pattern Sorting arrows Links to i. Pro. Class and Uni. Prot. KB reports Link to NCBI taxonomy Link to PIRSF report 46
PIRSF scan 47
PIRSF Report 48
PIRSF Family Hierarchy 49
Taxonomic Distribution & Phylogenetic Pattern 50
Rabbit Alpha Crystallin A Chain An i. Pro. Class View of the entry Pre-computed BLAST results See protein synonyms See IDs from different databases 51
alpha-Crystallin and Related Proteins 52
- Slides: 51