Biological databases Collection storage and maintenance Biological Database
Biological databases: Collection, storage and maintenance Biological Database as a collection of data that is structured, searchable, updated periodically, and crossreferenced
Biological databases: Collection, storage and maintenance • Heterogeneous content ~ Complex data type (Text base sequence, Blobs, images of cells and tissue , 3 -D molecular structure, biochemical pathway, model data , scalar and vector fields • Hierarchical data organization • Dynamic nature • Accessibility • Quality
The first database was of proteins • Atlas of Protein Sequence and Structure (1965) edited by Margaret Dayhoff. • It contains protein sequence that published at that time (Foundation of PIR) • Yeast t-RNA with 77 bases was first nucleotide sequence data base • Protein structural data base with 10 entries was first constructed in 1972. • First genome data base was published on 1995 with that Haemophilus influenzae
~100 GB
162886727 loci, 150, 141, 354, 858 bases, from 162, 886, 727 sequences as of 15 th Feb 2013
Categories of Databases 1. 2. 3. 4. 5. 6. Data Type (Data heterogeneity) Maintainer Status Technical Design Data Source Data Access And/or other parameter
1. Categories of Databases: Data Type l l l l l Taxonomy Database Genome Database Sequence database Structure Database Proteomic Database Micro-array Database Enzyme Database Disease Database Pathway Database Literature Database… Many More
Nucleotide Databases db. EST l db. GSS l db. SNP l db. STS l Nucleotide l Gen. Bank l Homolo. Gene l MGC Pop. Set Probe Ref. Seq TPA Trace Archive Uni. Gene Uni. STS
Protein Databases 3 D Domains PROW Proteins Ref. Seq Protein Clusters Structure Databases Conserved Domains Structure (MMDB) 3 D Domains Taxonomy Databases Taxonomy Genome Databases Cancer Chromosomes Genome Project COGs Genomes Gene
Expression Databases l GEO Profiles SAGE l GEO Datasets Chemical Databases Pub. Chem Bio. Assay Pub. Chem Compound Pub. Chem Substance
2. Categories of Databases: Maintainer Status l NCBI (Federal Govt. agency of USA) (http: //www. ncbi. nlm. nih. gov/) l EBI/EMBL(Non-profit academic organization) (http: //www. ebi. ac. uk/) SIB (Quasi-academic non-profit foundation) (http: //www. isb-sib. ch) l
http: //www. ncbi. nlm. nih. gov/
3. Categories of Databases: Technical Design Flat file (Information store in text files) l XML (Extensible markup language) (Hierarchical semi-structured model) l Relational model (Highly structured model) (It has tables with rows (tuples or record) and columns (field) supports by RDBMS like SQL, Oracle, DB 2) l Object-oriented database management system l ASN. 1 (abstract syntax notation) l
• This information is organised into a tabular form, as it usually done in relational DB. The number of columns (fields) in such DB is much larger than in the table below. • An index of these fields can be made, which allows very fast search of a DB using one or few field simultaneously. • The information in one DB can be cross-referenced to that in another DB. For instance DNA, protein and reference DB have all been cross-references so that moving between them is readily accomplished. Accession Organism Reference Name Keywords Sequence No 123 E. coli Medline 1, Lex. A SOS regulon, ATGCCGG… protein repressor, … 124 H. sapiens Medline 2, glucorticoid transcriptional CCGATAAC receptor regulator
Example of object-oriented DB
Structure Advantages Flat File Fast data retrieval, Simple Comparison structure, easy programming Disadvantages Difficult to process multiple value, adding new data require reprogramming, slow without the key Hierarchical Addition and deletion easy, fast retrieval through higher level records, multiple association with like records Pointer require large computer storage, pointer path restricts access, each association requires repetitive data Relational Easy access, minimal training for users, flexible for unforeseen enquiry, easy modification, physical storage of data can be changed without affecting the relationship Sequential access is slow, prone to logical mistakes, method of storage impact processing time, new relation require considerable processing
Database Data format Data type Gen. Bank OMIM DNA/RNA seq, Text file/ASN. 1 phynotype, Text file/ASN. 1 genotype Text, Numeric Text file GDB Ace. DB Genetic map Relational/My. SQL Object oriented Text, Numeric Medline NCBI Literature Seq, str, literature ASN. 1 Text, Numeric PDB BLAST Clustal. W KEGG Microarray Structure Seq, Analysis Metabolic path Microarray data Oracle Fasta HTML text, binary RDBMS, Excel 3 D Image Text, Numeric Images, text
4. Categories of Databases: Data Source Type -1 Primary (From experimental sources) Nucleic acid sequence, protein structure Secondary (From already existing primary database) Genomic (Ti. GR human gene index), Proteomic (Prosite, CATH) Type -2 Nucleic acids Literature (pubmed) Biomacromolecules Pathways
DNA Sequence Database l National Center for Biotechnology Information (NCBI) http: //www. ncbi. nlm. nih. gov l DNA Databank of Japan (DDBJ) http: //www. ddbj. nig. ac. jp l European Molecular Biology Laboratory (EMBL) l http: //www. embl-heidelberg. de
Protein sequence Database
European Bioinformatics Institute Swiss Institute of Bioinformatics Georgetown University
International Nucleotide Sequence Database Collaboration (INSD). Exchange data on a hourly basis Mirroring Data backup
Protein structure Database http: //www. rcsb. org/pdb/index. html
PDB
PDB
Secondary database
http: //rebase. neb. com/rebase. html
5. Categories of Databases: Data Access l Publicly available l Available with copyright l Browsing but not downloadable l Academic but not free l Commercial access with payment
6. Categories of Databases: Others l Completeness l Curation …. . (annotation)
ENTREZ • DB of different kind merged together and become global hubs of knowledge.
1. Nucleotide Sequence Databases 2. RNA sequence databases 3. Protein sequence databases 4. Structure Databases 5. Genomics Databases (non-human) 6. Metabolic Enzymes and Pathways; Signaling Pathways 7. Human and other Vertebrate Genomes 8. Human Genes and Diseases 9. Microarray Data and other Gene Expression Databases 10. Proteomics Resources 11. Other Molecular Biology Databases
For a detailed list and full coverage see http: //nar. oxfordjournals. org/content/41/D 1. toc
NCBI resources Databases Online analysis tools
Entrez @ http: //www. ncbi. nlm. nih. gov/
Sequence Retrieval System (http: //srs. ebi. ac. uk)
- Slides: 53