Nucleotide and Protein sequence and structure databases The

Nucleotide and Protein sequence and structure databases

The different types of databases in bioinformatics Organisation: • flat files • Relational databases • Object-oriented databases Availability: • Publicly available, no restriction • • Available, but with copyright Accessible, but not downloadable Academic, but not freely available Commercial Curators: • Large, public institution (EMBL, NCBI) • Quasi-academic institute (Swiss institute of Bioinformatics, TIGR, …) • Academic group or scientists • Commercial company

Identifiers and Accession numbers ¢ Identifier: string of letters and digits that somehow define the sequence or structure l l ¢ Example: TPIS_CHICK (Triose Phosphate Isomerase from chicken (gallus) ) in Swiss. Prot The identifier can change (based on the curator) Accession code: a string of letters and digits that uniquely identifies an entry in its database. l l The accession number for TPIS_CHICK in Swissprot is P 00940 Accession number should not changed!!

Nucleotide sequence databases EMBL, Gen. Bank, and DDBJ are three primary nucleotide sequence databases ¢ EMBL www. ebi. ac. uk/embl/ ¢ Gen. Bank www. ncbi. nlm. nih. gov/Genbank/ ¢ DDBJ www. ddbj. nig. ac. jp ¢

Genbank ¢ An annotated collection of all publicly available nucleotide and proteins ¢ Set up in 1979 at the LANL (Los Alamos). ¢ Maintained since 1992 NCBI (Bethesda). ¢ http: //www. ncbi. nlm. nih. gov

EMBL Nucleotide Sequence Database ¢ An annotated collection of all publicly available nucleotide and protein sequences ¢ Created in 1980 at the European Molecular Biology Laboratory in Heidelberg. ¢ Maintained since 1994 by EBI- Cambridge. ¢ http: //www. ebi. ac. uk/embl. html

http: //www 3. ebi. ac. uk/Services/DBStats/

DDBJ–DNA Data Bank of Japan ¢ An annotated collection of all publicly available nucleotide and protein sequences ¢ Started, 1984 at the National Institute of Genetics (NIG) in Mishima. ¢ Still maintained in this institute a team led by Takashi Gojobori. ¢ http: //www. ddbj. nig. ac. jp

Sequence submission Data mainly direct submissions from the authors. ¢ Submissions through the Internet: ¢ Web forms. l Email. l ¢ Sequences shared/exchanged between the 3 centers on a daily basis: l The sequence content of the banks is identical.

Derived databases ¢ ¢ ¢ CUTG Codon usage tabulated from Gen. Bank http: //www. kazusa. or. jp/codon/ Genetic Codes Deviations from the standard genetic code in various organisms and organelles http: //www. ncbi. nlm. nih. gov/Taxonomy/Utils/wprintgc. cgi? m ode=c TIGR Gene Indices Organism-specific databases of EST and gene sequences http: //www. tigr. org/tdb/tgi. shtml Uni. Gene Unified clusters of ESTs and full-length m. RNA sequences http: //www. ncbi. nlm. nih. gov/Uni. Gene/ ASAP Alternative spliced isoforms http: //www. bioinformatics. ucla. edu/ASAP Intronerator Introns and alternative splicing in C. elegans and C. briggsae http: //www. cse. ucsc. edu/~kent/intronerator/

Sequence Retrieval Tools ¢ Various tools to get sequences of interests from databases l Entrez in NCBI http: //www. ncbi. nlm. nih. gov/Entrez l SRS for EMBL and other DBs http: //srs. ebi. ac. uk Fetch in GCG package l Seqret in EMBOSS l

Protein Databases General Sequence databases ¢ Protein properties ¢ Protein localization and targeting ¢ Protein sequence motifs and active sites ¢ Protein domain databases; protein classification ¢ Databases of individual protein families ¢

http: //www. ncbi. nlm. nih. gov/protein NCBI Protein database ¢ ¢ The NCBI Entrez Protein database Sequences from: Swiss. Prot, the Protein Information Resource, the Protein Research Foundation, the Protein Data Bank, and translations from annotated coding regions in the Gen. Bank and Ref. Seq databases. Protein sequence records in Entrez have links to pre -computed protein BLAST alignments, protein structures, conserved protein domains, nucleotide sequences, genomes, and genes.

Swiss-Prot ¢ ¢ ¢ The Swiss-Protein Knowledgebase is a curated protein sequence database established in 1986. It provides a high level of annotation (such as the description of protein function, domains structure, post-translational modifications, variants, etc. ), a minimal level of redundancy and high level of integration with other databases. Now, it is part of the Universal Protein Knowledgebase (a part of Uni. Prot), a "one-stop shop" that allows easy access to all publicly available information of protein sequence annotation.

Uni. Prot (http: //www. ebi. uniprot. org/index. shtml) ¢ The Swiss-Prot, Tr. EMBL, and PIR protein database activities have united to form the Universal Protein Resource (Uni. Prot) l l l Uniprot Knowledgebase (Uniprot. KB): curated Sequence information, annotations, linked to other databases. Uniprot Reference Clusters (Uni. Ref): removing sequence redundancy by merging sequences that are 100%, 90% and 50%, no annotations, linked to Knowledgebase and Uni. Parc records. Uniprot Archive (Uni. Parc): history of sequences, no annotation, linked to source records.

Trivia The shortest sequence is GWA_SEPOF (P 83570): 2 amino acids, a Neuropeptide from cuttle fish. ¢ The longest sequence is TITIN_MOUSE (A 2 ASS 6): 35213 amino acids, assembly and functioning of vertebrate striated muscles, defects cause myopathies. ¢

http: //www. expasy. org/sprot

• General Sequence databases

• General Sequence databases

General protein sequence databases Protein Sequence database Source Properties worth mentioning URL EXProteins with experimentally verified function Non redundant http: //www. cmbi. kun. nl/EXProt/ MIPS Proteins from genome sequencing projects Manually curated http: //mips. gsf. de/ NCBI Protein database Multiple source Blast results, structures http: //www. ncbi. nlm. nih. gov/entrez PA-GOSUB Protein sequences from model organisms GO assignment and subcellular localization http: //www. cs. ualberta. ca/~bioinfo/PA/GOSUB/ PIR multiple source Annotated sequences merged with Uniprot now PRF sequences, source includes literature search includes sequences not found in EMBL, Genbank and Swiss. Prot also includes synthetic proteins and peptides http: //www. prf. or. jp/en Ref. Seq multiple combination of manual and automated methods http: //www. ncbi. nlm. nih. gov/Ref. Seq/ Swiss-Prot multiple High level annotation and minimal level of redundancy http: //www. expasy. org/sprot

Protein Databases General Sequence databases ¢ Protein properties ¢ Protein localization and targeting ¢ Protein sequence motifs and active sites ¢ Protein domain databases; protein classification ¢ Databases of individual protein families ¢

DBs based on Protein properties ¢ AAindex: AAindex is a database of amino acid indices and amino acid mutation matrices ¢ Cybase: Cyclic proteins ¢ db. PTM: protein post-translational modification (PTM ) information ¢ i. Pro. LINK: Integrated Protein Literature, INformation and Knowledge ¢ PFD - Protein Folding Database ¢ PINT: Protein-protein Interactions Thermodynamic Database ¢ PPD: Protein p. Ka Database ¢ Pro. Therm: Thermodynamic database for proteins and mutants ¢ REFOLD: Data related to refolding experiments

Protein Databases General Sequence databases ¢ Protein properties ¢ Protein localization and targeting ¢ Protein sequence motifs and active sites ¢ Protein domain databases; protein classification ¢ Databases of individual protein families ¢

Protein localization and targeting ¢ ¢ ¢ ¢ DBSub. Loc - Database of protein Subcellular Localization LOCATE: manually curated, immunofluorescence-based assay data Mito. Nuc: database of nuclear encoded mitochondrial proteins in Metazoa NESbase: Leucine-rich nuclear export signal NLSdb: database of nuclear localization signals NMPdb - Nuclear matrix associated proteins database NOPdb - Nucleolar Proteome Database: NPD - Nuclear Protein Database: results from MS Nuclear Receptor Resource: NUREBASE: nuclear hormone receptors NURSA: nuclear receptors OGRe - Organellar Genome Retrieval: mitochondrial genomes PSORTdb: protein subcellular localizations: e. PSORT, c. PSORT Secreted Protein Database: human, mouse and rat. THGS - Transmembrane Helices in Genome Sequences

Protein Databases General Sequence databases ¢ Protein properties ¢ Protein localization and targeting ¢ Protein sequence motifs and active sites ¢ Protein domain databases; protein classification ¢ Databases of individual protein families ¢

Protein sequence motifs and active sites ¢ ¢ ¢ ¢ ¢ ASC - Active Sequence Collection Blocks Co. C COMe - Co-Ordination of Metals etc. Co. PS CSA - Catalytic Site Atlas e. BLOCKS e. F-site - Electrostatic surface of Functional site e. MOTIF Inter. Pro Metalloprotein Site Database O-GLYCBASE PDBSite Phospho. ELM Base PRINTS PROMISE Pro. Rule PROSITE Pro. Teus Sites. Base

Protein Databases General Sequence databases ¢ Protein properties ¢ Protein localization and targeting ¢ Protein sequence motifs and active sites ¢ Protein domain databases; protein classification ¢ Databases of individual protein families ¢

Protein domain databases; protein classification ¢ ¢ ¢ ¢ ¢ ¢ ¢ ADDA - Automatic Domain Decomposition Algorithm BAli. BASE BIOZON CDD Clu. STr - Clusters of Swiss-Prot and Tr. EMBL proteins COG - Clusters of Orthologous Groups of proteins Fun. Shift Fusion. DB Hits HSSP Inter. Dom Inter. Pro PROSITE, Pfam, PRINTS, Prodom, SMART, TIGRFAMs, PIR superfamily i. Pro. Class Mul. PSSM PALI PANDIT Pfam PIRSF Pro. Dom SP, Tr. EMBL Proto. Map Proto. Net SBASE SIMAP SMART Simple Modular Architecture Research Tool SUPFAM TCDB TIGRFAMs HMM, GO annotations, MSA Prot. Repeats. DB

Protein Databases General Sequence databases ¢ Protein properties ¢ Protein localization and targeting ¢ Protein sequence motifs and active sites ¢ Protein domain databases; protein classification ¢ Databases of individual protein families ¢

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ AARSDB ABCdb ARAMEMNON Bac. Tregulators (formerly Ara. C/Xyl. S database) CSDBase - Cold Shock Domain database DCCP - Database of Copper-Chelating Proteins DEx. H/D Family Database DSD Endogenous GPCR List EROP-Moscow ESTHER FUNPEP GPCRDB gp. DB - G-protein database Histone Database HIV RT and Protease Sequence Database Homeobox Page Homeodomain Resource In. Base Kin. G - Kinases in Genomes Knottins LGICdb Lipase Engineering Database Lipid MAPS LOX-DB MEROPS Nuclear Receptor Resource Nuclea. RDB NUREBASE NURSA Olfactory Receptor Database Peptaibol PHYTOPROT PLANT-PIs Plants. P/Plants. T PLPMDB Pro. Lys. ED - Prokaryotic Lysis Enzymes Database Prolysis Protein kinase resource REBASE Ribonuclease P Database RNRdb RPG - Ribosomal Protein Gene database RTKdb - Receptor Tyrosine Kinase database SDAP SENTRA SEVENS SRPDB Transport. DB VKCDB - Voltage-gated K+ Channel Database Wnt Databases of individual protein families

Protein Data. Bank (PDB) ¢ ¢ Important in solving real problems in molecular biology Protein Databank PDB Established in 1972 at Brookhaven National Laboratory (BNL) l Sole international repository of macromolecular structure data l Moved to Research Collaboratory for Structural Bioinformatics l http: //www. rcsb. org/

Effective use of PDB ¢ Queries are of three types PDBid - As quoted in paper l Search Lite - one or more keywords l Search Fields - A detailed query form l ¢ Query results Structure Explorer - details of the structure l Query Result Browser - for multiple structures l ¢ PDB Viewer

PDB: example HEADER LYASE(OXO-ACID) 01 -OCT-91 12 CA 2 COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12 CA 3 COMPND 2 (E. C. 4. 2. 1. 1) MUTANT WITH VAL 121 REPLACED BY ALA (/V 121 A) 12 CA 4 SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12 CA 5 AUTHOR S. K. NAIR, D. W. CHRISTIANSON 12 CA 6 REVDAT 1 15 -OCT-92 12 CA 0 12 CA 7 JRNL AUTH S. K. NAIR, T. L. CALDERONE, D. W. CHRISTIANSON, C. A. FIERKE 12 CA 8 JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12 CA 9 JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12 CA 10 JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12 CA 11 JRNL REF J. BIOL. CHEM. V. 266 17320 1991 12 CA 12 JRNL REFN ASTM JBCHA 3 US ISSN 0021 -9258 071 12 CA 13 REMARK 1 12 CA 14 REMARK 2 12 CA 15 REMARK 2 RESOLUTION. 2. 4 ANGSTROMS. 12 CA 16 REMARK 3 12 CA 17 REMARK 3 REFINEMENT. 12 CA 18 REMARK 3 PROGRAM PROLSQ 12 CA 19 REMARK 3 AUTHORS HENDRICKSON, KONNERT 12 CA 20 REMARK 3 R VALUE 0. 170 12 CA 21 REMARK 3 RMSD BOND DISTANCES 0. 011 ANGSTROMS 12 CA 22 REMARK 3 RMSD BOND ANGLES 1. 3 DEGREES 12 CA 23 REMARK 4 12 CA 24 REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12 CA 25 REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12 CA 26 REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12 CA 27 ………

PDB (cont. ) SHEET 3 S 10 PHE 66 PHE 70 -1 O ASN 67 N LEU 60 12 CA 68 SHEET 4 S 10 TYR 88 TRP 97 -1 O PHE 93 N VAL 68 12 CA 69 SHEET 5 S 10 ALA 116 ASN 124 -1 O HIS 119 N HIS 94 12 CA 70 SHEET 6 S 10 LEU 141 VAL 150 -1 O LEU 144 N LEU 120 12 CA 71 SHEET 7 S 10 VAL 207 LEU 212 1 O ILE 210 N GLY 145 12 CA 72 SHEET 8 S 10 TYR 191 GLY 196 -1 O TRP 192 N VAL 211 12 CA 73 SHEET 9 S 10 LYS 257 ALA 258 -1 O LYS 257 N THR 193 12 CA 74 SHEET 10 S 10 LYS 39 TYR 40 1 O LYS 39 N ALA 258 12 CA 75 TURN 1 T 1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12 CA 76 TURN 2 T 2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12 CA 77 TURN 3 T 3 ALA 134 GLN 137 TYPE I (GLN 136) 12 CA 78 TURN 4 T 4 GLN 137 GLY 140 TYPE I (ASP 139) 12 CA 79 TURN 5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12 CA 80 TURN 6 T 6 GLY 233 GLU 236 TYPE II (GLY 235) 12 CA 81 CRYST 1 42. 700 41. 700 73. 000 90. 00 104. 60 90. 00 P 21 2 12 CA 82 ORIGX 1 1. 000000 0. 00000 12 CA 83 ORIGX 2 0. 000000 1. 000000 0. 00000 12 CA 84 ORIGX 3 0. 000000 1. 000000 0. 00000 12 CA 85 SCALE 1 0. 023419 0. 000000 0. 006100 0. 00000 12 CA 86 SCALE 2 0. 000000 0. 023981 0. 000000 0. 00000 12 CA 87 SCALE 3 0. 000000 0. 014156 0. 00000 12 CA 88 ATOM 1 N TRP 5 8. 519 -0. 751 10. 738 1. 00 13. 37 12 CA 89 ATOM 2 CA TRP 5 7. 743 -1. 668 11. 585 1. 00 13. 42 12 CA 90 ATOM 3 C TRP 5 6. 786 -2. 502 10. 667 1. 00 13. 47 12 CA 91 ATOM 4 O TRP 5 6. 422 -2. 085 9. 607 1. 00 13. 57 12 CA 92 ATOM 5 CB TRP 5 6. 997 -0. 917 12. 645 1. 00 13. 34 12 CA 93 ATOM 6 CG TRP 5 5. 784 -0. 209 12. 221 1. 00 13. 40 12 CA 94 ATOM 7 CD 1 TRP 5 5. 681 1. 084 11. 797 1. 00 13. 29 12 CA 95 ATOM 8 CD 2 TRP 5 4. 417 -0. 667 12. 221 1. 00 13. 34 12 CA 96 ATOM 9 NE 1 TRP 5 4. 388 1. 418 11. 515 1. 00 13. 30 12 CA 97 ATOM 10 CE 2 TRP 5 3. 588 0. 375 11. 797 1. 00 13. 35 12 CA 98 ATOM 11 CE 3 TRP 5 3. 837 -1. 877 12. 645 1. 00 13. 39 12 CA 99 ATOM 12 CZ 2 TRP 5 2. 216 0. 208 11. 656 1. 00 13. 39 12 CA 100 ATOM 13 CZ 3 TRP 5 2. 465 -2. 043 12. 504 1. 00 13. 33 12 CA 101 ATOM 14 CH 2 TRP 5 1. 654 -1. 001 12. 009 1. 00 13. 34 12 CA 102 …….

Databases related to Proteomics Contain information obtained by 2 D-PAGE: master images of the gels and description of identified proteins ¢ Examples: SWISS-2 DPAGE, ECO 2 DBASE, Maize-2 DPAGE, Sub 2 D, Cyano 2 DBase, etc. ¢ Format: composed of image and text files ¢ Most 2 D-PAGE databases are “federated” and use SWISS-PROT as a master index ¢ Mass Spectrometry (MS) database ¢

Database searching tips Look for links to Help or Examples ¢ Always check updates ¢ Level of curation ¢ Try Boolean searches ¢ Be careful with UK/US spelling differences ¢ l leukaemia vs leukemia haemoglobin vs hemoglobin colour vs color

Homework ¢ ¢ ¢ Go to the Uniprot. KB Search for Lys 49 Phospholipase A 2 from Agkistrodon contortix laticinctus Describe, in paragraph form, the protein using functional information/characteristics from the site. Include a small picture of the protein structure Limit of one page, font size 58 ¢ 11, Arial. One-inch margin on all sides, short bond paper. Characteristics that can be described: l l l Protein name Sequence Source organism Function Variants History of research • Scientists and institutes involved l And many more!