Introduction to Databases INTRODUCTION DATA Data is raw
Introduction to Databases
INTRODUCTION
DATA Data is raw, unorganized facts that need to be processed. Example: - Each student's test score is one piece of data. INFORMATION When data is processed, organized, structured or presented in a given context so as to make it useful, it is called information. Example: - score of a class or of the average entire school is information that can be derived from the given data.
Database A database is a collection of data in an organized manner, which is accessible in various ways. Biological Databases serve a critical purpose in the collection and organization of data related to biological systems. They provide a computational support and a user-friendly interface to a researcher for a meaningful analysis of biological data.
A database is a computerized archive used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria. Databases are composed of computer hardware and software for data management. The chief objective of the development of a database is to organize data in a set of structured records to enable easy retrieval of information. Each record, also called an entry, should contain a number of fields that hold the actual data items, for example, fields for names, phone numbers, addresses, dates.
WHAT ARE THE BIOLOGICAL DATABASES ? ? ?
Different classifications of databases Type of data nucleotide sequences proteins sequence patterns or motifs macromolecular 3 D structure gene expression data metabolic pathways
Different classifications of databases…. Primary or derived databases Primary databases: experimental results directly into database Secondary databases: results of analysis of primary databases Aggregate of many databases Links to other data items Combination of data Consolidation of data
Different classifications of databases…. Availability Publicly available, no restrictions Available, but with copyright Accessible, but not downloadable Academic, but not freely available Proprietary, commercial; possibly free for academics
TYPES OF DATABASES Primary Databases Secondary Databases
PRIMARY DATABASES � Contains bio-molecular data in its original form. � Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature. � Once given a database accession number, the data in primary databases are never changed. � Examples : - Gen. Bank, EMBL and DDBJ for DNA/RNA sequences, SWISS-PROT and PIR for protein sequences and PDB for molecular structures.
Gen. Bank http: //www. ncbi. nlm. nih. go v/genbank/ • Database from NCBI, includes sequences from publicly available resources.
NCBI and Entrez One of the largest and most comprehensive databases belonging to the NIH – national institute of health (USA) Entrez is the search engine of NCBI Search for : genes, proteins, genomes, structures, diseases, publications and more. http: //www. ncbi. nlm. nih. gov/ 15
Genbank An annotated collection of all publicly available nucleotide and proteins Set up in 1979 at the LANL (Los Alamos). Maintained since 1992 NCBI (Bethesda).
Gen. Bank file format
Gen. Bank file format
EMBL http: //www. ebi. ac. uk / European Molecular Biological Laboratory Nucleic acid database from EBI (European Bioinformatics Institute) Produced in collaboration with DDBJ and Gen. Bank Search engine – SRS (Sequence Retrieval System)
DDBJ DNA http: //www. ddbj. nig. ac. jp/ Databank of Japan Started in 1986 in collaboration with Gen. Bank Produced and maintained at (National Institute of Genetics) NIG
SWISS PROT Annotated in 1986 Consists of lie formats Similar format to EMBL http: //us. expasy. org/sprot-top. html …. . . sequence http: //www. ebi. ac. uk/uniprot/ sequence database established entries of different
PIR • Protein http: //pir. georgetown. ed u/ Information Resource • A division of National Biomedical Research • Foundation (NBRF) in U. S. • One can search for entries or do sequence similarity search at PIR site.
Tr. EMBL http: //www. ebi. ac. uk/trembl/ �Translated European Molecular Biology Laboratory �Computer annotated supplement of SWISS PROT. �Contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS PROT.
Protein Data. Bank (PDB) Important in solving real problems in molecular biology Protein Databank PDB Established in 1972 at Brookhaven National Laboratory (BNL) Sole international repository of macromolecular structure data Moved to Research Collaboratory for Structural Bioinformatics http: //www. rcsb. org/
PDB: example HEADER LYASE(OXO-ACID) 01 -OCT-91 12 CA 2 COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN AUTHOR S. K. NAIR, D. W. CHRISTIANSON REVDAT 1 15 -OCT-92 12 CA 3 12 CA 5 12 CA 6 0 12 CA 7 JRNL AUTH S. K. NAIR, T. L. CALDERONE, D. W. CHRISTIANSON, C. A. FIERKE 12 CA 8 JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12 CA 10 JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 JRNL REFN ASTM JBCHA 3 US ISSN 0021 -9258 J. BIOL. CHEM. REMARK 3 R VALUE 12 CA 12 071 12 CA 13 12 CA 14 EMARK 3 AUTHORS 12 CA 20 0. 170 REMARK 3 RMSD BOND DISTANCES REMARK 3 RMSD BOND ANGLES REMARK 4 12 CA 11 V. 266 17320 1991 REMARK 1 HENDRICKSON, KONNERT 12 CA 9 12 CA 21 0. 011 ANGSTROMS 1. 3 DEGREES 12 CA 22 12 CA 23 12 CA 24 REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12 CA 25 REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12 CA 26 REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12 CA 27 ………
COMPOSITE DATABASES �Collection of various primary database sequences �Renders sequence searching highly efficient as it searches multiple resources �Examples : - NRDB (Non Redundant Database), OWL, MIPSX, SWISS PROT + Tr. EMBL
SECONDARY DATABASES q. Contains data derived from the results of analysing primary data q. Manually created or automatically generated q. Contains more relevant and useful information structured to specific requirements q. Example : - PROSITE, PRINTS, BLOCKS, Pfam
PROSITE http: //ca. expasy. org/ prosite/ Families of proteins Can search using regular expressions Similar to unix commands Families exhibit these patterns So we can search over families
BLOCKS Motifs/blocks are created by automatically detecting the most conserved regions of each protein family.
PRIMARY VS SECONDARY DATABASES
- Slides: 32