Exploring Protein Sequences You want to learn everything














































- Slides: 46
Exploring Protein Sequences You want to learn everything possible about your own protein sequence. Multiple sequence alignments of related sequences can build up consensus sequences of known families, domains, motifs or sites. Combining these predictions with primary biochemical data can provide valuable insights into protein structure and function Let’s make a quick tour through: – Patterns and Motifs – Domains and domain databases Celia van Gelder CMBI Radboud University June 2006 ©CMBI 2005
Exploring Protein Sequences Part 1: Part 2: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices (Antigenic Prediction) Signal Peptides Repeats (Coiled Coils) Protein Domains Domain databases ©CMBI 2005
Patterns and Motifs (1) • In a multiple sequence alignment (MSA) islands of conservation emerge • These conserved regions (motifs, segments, blocks, features) are typically around 10 -20 aa in length • They tend to correspond to the core structural or functional elements of the protein • Their conserved nature allows them to be used to diagnose family membership ©CMBI 2005
Patterns and Motifs (2) • A motif (or pattern or signature) is a regular expression for what residues can be present at any given position. • Motifs can contain - alternative residues - flexible regions C-x(2, 5)-C-x-[GP]-x-P-x(2, 5)-C CXXXCXGXPXXXXXC | | | FGCAKLCAGFPLRRLPCFYG Syntax: A-[BC]-X-D(2, 5)-{EFG}-H Means: A B or C Anything 2 -5 D’s Not E, F or G H
Patterns and Motifs (3) • Motifs can not contain - mismatches exact match or no match at all - gaps C-x(2, 5)-C-x-[GP]-x-P-x(2, 5)-C CXXCXGXPXXXXX-C | ? | | | ? | FGCA-CAGFPLRRLPKCFYG J. Leunissen
PROSITE • PROSITE - A Dictionary of Protein Sites and Patterns • 1328 patterns and 577 profiles/matrices (dec 2005) • For every pattern or profile there is documentation present (e. g. PDOC 00975) - information on taxonomic occurrence - domain architecture, - function, - 3 D structure, - main characteristics of the sequence - some references.
PROSITE Pattern • PROSITE patterns consist of an exact regular expression • Possible patterns occur frequently in proteins; they may not actually be present, such as post-translational modification sites ID ASN_GLYCOSYLATION; PATTERN. DE N-glycosylation site. PA N-{P}-[ST]-{P}. • Notice also in the PROSITE record the number of false positives and false negatives
PROSITE Pattern (2) ©CMBI 2005
Profiles • If regular expressions fail to define the motif properly we need a profile. • Profiles are specific representations that incorporate the entire information of a multiple sequence alignment. • A profile is a position-specific scoring scheme and holds for each position in the sequence 20 scores for the 20 residue types, and sometimes also two values for gap open and gap elongation. • Profiles provide a sensitive means of detecting distant sequence relationships
©CMBI 2005
Hydropathy plots are designed to display the distribution of polar and apolar residues along a protein sequence. A positive value indicates local hydrophobicity and a negative value suggests a water-exposed region on the face of a protein. (Kyte-Doolittle scale) Hydropathy plots are generally most useful in predicting transmembrane segments, and N-terminal secretion signal sequences. ©CMBI 2005
Hydrophobicity is the most important characteristic of amino acids. It is the hydrophobic effect that drives proteins towards folding. Actually, it is all done by water. Water does not like hydrophobic surfaces. When a protein folds, exposed hydrophobic side chains get buried, and release water of its sad duty to sit against the hydrophobic surfaces of these side chains. Water is very happy in bulk water because there it has on average 3. 6 Hbonds and about six degrees of freedom. So, whenever we discuss protein structure, folding, and stability, it is all the entropy of water, and that is called the hydrophobic effect. ©CMBI 2005
Hydropathy scales ©CMBI 2005
Sliding Window Approach Sum amino acid property (e. g. hydrophobicity values) in a given window Plot the value in the middle of the window I L I K E I R 4. 50+3. 80+4. 50 -3. 90 -3. 50+4. 50 -4. 50 = 5. 40 => 5. 4/7=0. 77 Move to the next position in the sequence L I K E I R Q +3. 80+4. 50 -3. 90 -3. 50+4. 50 -4. 50 – 3. 50 = => -2. 6/7=-0. 37 J. Leunissen
Hydropathy plot for rhodopsin The window size can be changed. A small window produces "noisier" plots that more accurately reflect highly local hydrophobicity. A window of about 19 is generally optimal for recognizing the long hydrophobic stretches that typify transmembrane stretches. ©CMBI 2005
Transmembrane Helices Transmembrane proteins are integral membrane proteins that interact extensively with the membrane lipids. Nearly all known integral membrane proteins span the lipid bilayer Hydropathy analysis can be used to locate possible transmembrane segments The main signal is a stretch of hydrophobic and helix-loving amino acids ©CMBI 2005
Transmembrane Helices (2) In a -helix the rotation is 100 degrees per amino acid The rise per amino acid is 1, 5 Å To span a membrane of 30 Å approx. 30/1, 5 = 20 amino acids are needed ©CMBI 2005
Transmembrane Helix Prediction Servers 1. KDD 2. Tmpred (database Tmbase) 3. DAS 4. Top. Pred II 5. TMHMM 2. 0 6. MEMSAT 2 7. SOSUI 8. HMMTOP 2. 0 ©CMBI 2005
Signal Peptides Proteins have intrinsic signals that govern their transport and localization in the cell (nucleus, ER, mitochondria, chloroplasts) Specific amino acid sequences determine whether a protein will pass through a membrane into a particular organelle, become integrated into the membrane, or be exported out of the cell. ©CMBI 2005
Signal Peptides (2) The common structure of signal peptides from various proteins is described as: • a positively charged (N-terminal) n-region • followed by a hydrophobic h-region (which can adopt an -helical conformation in an hydrophobic environment) • and a neutral but polar c-region (cleavage region; the signal sequence is cleaved off here after delivering the protein at the right site). The (-3, -1) rule states that the residues at positions – 3 and – 1 (relative to the cleavage site) must be small and neutral for cleavage to occur correctly. ©CMBI 2005
Prediction of Signal Peptides Prokaryotes and Eukaryotes: Signal. P 3. 0 SPScan Sig. Cleave PSORT Eukaryotes: SIGFIND Target. P Specific localization signals: Predict. NLS - Nuclear Localization Signals Chloro. P – Chloroplast transit peptides Net. Nes – Nuclear Export Signals ©CMBI 2005
Repeats in proteins • Although they are usually found in non-coding genomic regions, repeating sequences are also found within genes. • Ranging from repeats of a single amino acid, through three residue short tandem repeats (e. g. in collagen), to the repetition of homologous domains of 100 or more residues. • Duplicated sequence segments occur in 14 % of all proteins, but eukaryotic proteins are three times more likely to have internal repeats than prokaryotic proteins ©CMBI 2005
Repeats, example 2 ©CMBI 2005
Prediction of Repeats • Repsim (a database of simple repeats) • Rep (Searches a protein sequence for repeats) • RADAR (Rapid Automatic Detection and Alignment of Repeats in protein sequences. ) • REPRO (De novo repeat detection in protein sequences) • Other? ©CMBI 2005
Definition of protein domains • Group of residues with high contact density, number of contacts within domains is higher than the number of contacts between domains. • A stable unit of protein structure that can fold autonomously • A rigid body linked to other domains by flexible linkers • A portion of the protein that can be active on its own if you remove it from the rest of the protein. ©CMBI 2005
Protein Domains • Domains can be 25 to 500 residues long; most are less than 200 residues • The average protein contains 2 or 3 domains • The total number of different types of domains ~1000 – 3000 • The same or similar domains are found in different proteins. “Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977). “Nature is smart but lazy” • Usually, each domain plays a specific role in the function of the protein. ©CMBI 2005
Linkers Domain linkers link the protein domains together and have been found to contain an amino acid signature that is distinct from the structurally compact domains. Average linker size 8 -9 amino acids Linkers are susceptible for protease attack and they are flexible. ©CMBI 2005
Protein Domain Databases Even though the structure of a domain is not always known it is still possible to define the domain boundaries from sequence alone Many of the common domains have already been defined in domain databases Advantages: • Pre-annotated domains • Easy interpretation of domain structure Problem: • Not trivial to define domain boundaries unambiguously ©CMBI 2005
Protein Domains http: //ip 30. eti. uva. nl/ember-demo/ch 3
Domain databases (2) Generation #entries Pfam. A manual 7503 families Pfam. B automatic >140, 000 families Prints manual 11, 170 motifs Prosite Profiles manual 577 profiles Blocks automatic 28, 337 blocks, 5733 groups SMART manual 667 HMMs Pro. Dom automatic 501, 917 domain families ©CMBI 2005
PRINTS database • Most protein families are characterised not by one, but by several conserved motifs • Fingerprints are groups of conserved motifs excised from sequence alignments • Taken together, they provide diagnostic family signatures. They are the basis of the PRINTS database, and are stored in the form of aligned motifs • Input about protein families is done manually • True members match all elements of the fingerprint in order, subfamily members may match part of fingerprint ©CMBI 2005
PRINTS database http: //ip 30. eti. uva. nl/ember-demo/ch 3
PRINTS ©CMBI 2005
Pro. Dom: The Protein Domain Database • Pro. Dom is a comprehensive set of protein domain families automatically generated • Each entry provides a multiple sequence alignment of homologous domains and a family consensus sequence. • Current Pro. Dom release: Pro. Dom 2004. 1, June 2004, 501917 domain families ©CMBI 2005
Pfam (Protein families) is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. For each family in Pfam you can: • Look at multiple alignments • View the domain organisation of proteins • Examine species distribution • Follow links to other databases • View known protein structures ©CMBI 2005
Pfam Two distinct parts: –Pfam-A entries are manually curated 7503 families –Pfam-B entries automatically generated clusters >140, 000 (not covered by Pfam-A) New: i. Pfam is a resource that describes domain-domain interactions that are observed in known structures ©CMBI 2005
©CMBI 2005
SMART - Simple Modular Architecture Research Tool Domain families found in: 1) signalling 2) nuclear 3) extracellular 4) other Current version 5. 0: Number of SMART HMMs: 669 You can use SMART in two different modes: normal or genomic. ©CMBI 2005
Bacteriorhodopsin Human serine protease ©CMBI 2005
Limitations of domain databases • Patterns not present for all families of proteins • Multiple sequence alignment to define patterns could be inaccurate due to an automatic alignment • Low number of sequences from different species could result in inaccurate patterns ©CMBI 2005
Integrating Pattern databases Inter. Pro - Integrated Documentation Resource of Protein Families, Domains and Functional Sites. Inter. Pro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences. The aim is to provide a one-stop-shop for protein family diagnostics ©CMBI 2005
Inter. Pro Member Databases Prosite (regular expressions and profiles) Pfam, SMART, TIGRFAMs, PIRSF, PANTHER, Gene 3 D and SUPERFAMILY (hidden Markov Models - HMMs) PRINTS (groups of aligned, un-weighted motifs) Pro. Dom (uses cluster analysis to group sequences) Release 12. 0 contains 12542 entries Types of entries: Family, Domain, Repeat, PTM, Binding Site, Active Site ©CMBI 2005
©CMBI 2005
©CMBI 2005
©CMBI 2005
Summary • Many different protein signature databases exist (from small patterns to alignments to complex HMMs) • The databases have different strengths and weaknesses. Some databases can be better for your sequence than others • Therefore: best to combine methods, preferably in an integrated database • The quality of a database/server is best tested with a sequence you know very well • Always do control experiments: never trust a server ©CMBI 2005