BIN 503 BIOLOGICAL DATABASES AND DATA ANAYSIS TOOLS
BIN 503 – BIOLOGICAL DATABASES AND DATA ANAYSIS TOOLS Spring 2016 on Tuesdays, 9: 40 -12: 30
Course Overview • • Instructor: Nurcan Tuncbag, Ph. D Email: ntuncbag@metu. edu. tr Assistant: Mine Yoldas, myoldas@metu. edu. tr Office hours: Wed 10 -12. 1 Midterm, 1 Final Project with Presentation. In-class works and problem assignments. 1 Journal Club with Student Presentations. #metu. BIN 503
https: //www. facebook. com/groups/175193 595896859/ iscbsc-rsg-turkey@googlegroups. com
Week I: The World of Data in Biology
Biological organization Changes and interaction between molecules within the cell.
Copyright: RIKEN
a small portion of an Escherichia coli cell. David Goodsell 2000
© Jane Ades, NHGRI In the alphabet of our genes there are four letters: A, C, G and T. These letters tell us how to live, how to grow, how to die.
TAGTTCCGTCGCAGCCGGGATTTGGGTCGCGGTTCTTGTGGATCGCTGTGATCGTCACTTGACAATGCAGATCTTCGTGAAGACTCTGACTGGTAAGACCATCACCCTCGA GGTTGAGCCCAGTGACACCATCGAGAATGTCAAGGCAAAGATCCAAGATAAGGCATCCCTCCTGACCAGCAGAGGCTGATCTTTGCTGGAAAACAGCTGGAAGATGGGCGC ACCCTGTCTGACTACAACATCCAGAAAGAGTCCACCCTGCACCTGGTGCTCCGTCTCAGAGGTGGGATGCAATCTTCGTGAAGACACTGGCAAGACCATCACCCTTGAGGT GGAGCCCAGTGACACCATCGAGAACGTCAAAGATCCAGGACAAGGCATTCCTCCTGACCAGCAGAGGTTGATCTTTGCCGGAAAGCAGCTGGAAGATGGGCGCACC CTGTCTGACTACAACATCCAGAAAGAGTCTACCCTGCACCTGGTGCTCCGTCTCAGAGGTGGGATGCAGATCTTCGTGAAGACCCTGACTGGTAAGACCATCACCCTCGAGGTGG AGCCCAGTGACACCATCGAGAATGTCAAGGCAAAGATCCAAGATAAGGCATTCCTCCTGATCAGCAGAGGTTGATCTTTGCCGGAAAACAGCTGGAAGATGGTCGTACCCT GTCTGACTACAACATCCAGAAAGAGTCCACCTTGCACCTGGTACTCCGTCTCAGAGGTGGGATGCAAATCTTCGTGAAGACACTGGCAAGACCATCACCCTTGAGGTCGAG CCCAGTGACACTATCGAGAACGTCAAAGATCCAAGACAAGGCATTCCTCCTGACCAGCAGAGGTTGATCTTTGCCGGAAAGCAGCTGGAAGATGGGCGCACCCTGT CTGACTACAACATCCAGAAAGAGTCTACCCTGCACCTGGTGCTCCGTCTCAGAGGTGGGATGCAGATCTTCGTGAAGAC 4 -letter alphabet is translated to a 20 letter alphabet of proteins MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
OMICS Pyramid Interactome Proteome Transcriptome Genome Nodes = Proteins/Genes/other entities Edges = Interactions These and many other omic entities interact as a network called interactome.
The first high-level programming language, FORTRAN (formula translation), was introduced by the International Business Machines (IBM) corporation in 1957.
Frederick Sanger determined the first complete amino acid sequence of a protein (insulin) in 1955. Later RNA and DNA sequencing Margaret Dayhoff pioneered methods of sequence alignment and molecular evolution
• One of the first biological sequence databases was probably the book "Atlas of Protein Sequences and Structures" by Margaret Dayhoff and colleagues, first published in 1965.
During this period, three dimensional structure of proteins were studied and the well known Protein Data Bank was developed as the first protein structure database with only 10 entries in 1972.
The Human Genome Project • A milestone in data accumulation
George Church, Harvard University: “My brain is only a gigabyte or so; it's worse than my cell phone. ” “DNA does not compute well, but it does store well. ” Francis de. Souza, President of Illumina "A gram of DNA is over a petabyte of data“ 1 petabyte = 1000 terabyte
The rapid accumulation of sequencing data
George Church, Harvard University “New genomic technologies will influence our lives at least as much as electronics” the cost of a smart phone ~ $1, 000. the cost of full genome sequencing an individual or patient costs roughly $1, 000.
Growth of the total number of protein structures in Protein Data Bank © E. Levy
Technology development: large-scale computer networks, immense databases, supercomputers
Experimental Datasets Metabolites (Mass spec)
The field has been revolutionized from Reductionism to Systems Science Reductionism Single molecule Additive Disease-driven Systems Science Interrelationships Synergistic Personal treatment Organisms are clearly much more than the sum of their parts
Less Data More Data Making sense of these data is a big challenge. Big opportunities for computational scientists to step in and make big contributions to biology and medicine.
Overview
What is a Database? • A collection of – structured – searchable (index) -> table of contents – updated periodically (release) -> new edition – cross-referenced (hyperlinks) -> links with other db data • Includes also associated tools (software) necessary for db access/query, db updating, db information insertion, db information deletion….
What can databases do ? • Make biological data available. . . – … to scientists. – … in computer-readable form. • Analysis (computer based) • Handle and share large volumes of data interface for computer based systems (Algorithms, Web interfaces) – Store data – Defined formats • Automated storage and retrieval of experimental data • Link knowledge with external resources
• The computer became the storage medium of choice as soon as it was accessible to ordinary scientists. • Databases were distributed – on tape, – on various kinds of disks. – On the World Wide Web(WWW, based on the Internet protocol HTTP) since the beginning of the 1990 s
1 Primary nucleotide sequence databases 2 Meta databases 3 Genome databases 4 Protein sequence databases 5 Proteomics databases 6 Protein structure databases 7 Protein model databases 8 RNA databases 9 Carbohydrate structure databases 10 Protein-protein and other molecular interactions 11 Signal transduction pathway databases 12 Metabolic pathway and Protein Function databases 13 Microarray databases 14 Exosomal databases 15 Mathematical model databases 16 PCR and quantitative PCR primer databases 17 Phenotype databases 18 Specialized databases 19 Taxonomic databases 20 Wiki-style databases 21 Metabolomic Databases
• Nucleic Acid Research offers database issue every year • Database Journals – Database: The Journal of Biological Databases and Curation • Websearch
Reminder • April 26: Journal Club on databases published in NAR Database Issue. • We will have a session called “Speed Science” in which each student has 10 minutes (strictly enforced) to describe the selected paper to the group. • Submit a summary on what you have presented (max 250 words). Include only important points. • There will be questions in the Midterm on these submitted abstracts.
The growth in the number of database publications per year. Each bar shows the number of research articles with the keyword ‘database’ appearing in the article title in the given year.
http: //www. biodbs. info/
Literature Databases • Pub. Med comprises more than 23 million citations for biomedical literature from MEDLINE, life science journals, and online books.
Examples of biological databases
Genome Databases Ensembl provides a bioinformatics framework to organise biology around the sequences of large genomes. UCSC Genome Browser contains the reference sequence and working draft assemblies for a large collection of genomes.
Nucleotide sequence databases DNA DATABANK OF JAPAN NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION USA EUROPEAN NUCLEOTIDE ARCHIVE
Nucleotide sequence databases Human Numb
Genomic and Epigenomic Data Sources
GWAS
Protein sequence databases: Uni. Prot • mission to provide a comprehensive, high-quality and freely accessible resource of protein sequence and functional information • protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc. ), a minimal level of redundancy and high level of integration with other databases.
Protein sequence databases
Databases for Gene Expression
A micro. RNA is a small non-coding RNA molecule (containing about 22 nucleotides) functions in RNA silencing and post-transcriptional regulation of gene expression. database of published mi. RNA sequences and annotation predicts biological targets of mi. RNAs
Protein-Protein Interaction Databases
Protein-Protein Interaction Databases
Functional Annotations
Enrichment Analysis
Pathway Databases
Pathway Databases BCR Signaling Pathway
Examples of data analysis tools
Analysis and Visualization Tools
Analysis and Visualization Tools
Analysis and Visualization Tools
- Slides: 63