WELCOME TO BCB 4003CS 4803 BCB 503CS 583
WELCOME TO BCB 4003/CS 4803 BCB 503/CS 583 BIOLOGICAL AND BIOMEDICAL DATABASE MINING Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI
WHY THIS COURSE? Genome 1980’s-1990’s Sequencing, sequence analysis, … Transcriptome mid 1990’s-2000’s Gene expression, DNA/RNA microarrays Proteome 1990’s-2000’s Protein structure, protein-protein interactions, protein pathways Biological and Biomedical Research Problems Applications 2000’s Organism-organism interactions Organism-environment interactions Genome-wide association studies Cancer therapies Drug development Central dogma: DNA (trascription) RNA (translation) Protein Biological Function 2000’s
THIS ALL HAS GENERATED … • Data • Massive datasets and databases of sequence, gene expression, protein, biological function, clinical information, … • Text • Annotations in data sources, abstracts (e. g. , Medline), research articles, medical literature (e. g. , Pub. Med, NCBI Bookshelf, Google Scholar), patients records, … • Ontologies • Description of terms and their relationship • (e. g. , Gene Ontology)
CURRENT CHALLENGES • To make sense of and put to use all this information. • How? Computational tools and techniques are needed to help humans in integrating, summarizing, understanding, and taking advantage of accumulated information • Data mining • Text mining • Data and text mining together
WHAT IS DATA [TEXT] MINING? OR MORE GENERALLY, KNOWLEDGE DISCOVERY IN DATABASES (KDD) “Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [text]” (Fayyad et al. , 1996) • Raw Data [Text] Mining • Patterns • Analytical Patterns (rules, decision trees) • Statistical Patterns (data distribution) • Visual Patterns Fayyad, U. , Piatetsky-Shapiro, G. , and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37 -54. Fall 1996.
DATA MINING METHODS IN BIOINFORMATICS • Clustering • Sequence Mining • Bayesian Methods • Expectation Maximization (EM) • Gibbs Sampling • Hidden Markov Models • Kernel methods • Support Vector Machines
TEXT MINING IN BIOINFORMATICS • Document indexing • Information retrieval • Lexical analysis (Sentence tokenization, Word tokenization, Stemming, Stop word removal) • Semantic analysis • Query processing • Text classification • Text clustering • Text summarization • (Semi-) Automatic curation of literature repositories • Knowledge discovery from text, hypothesis generation
DATA/TEXT MINING PROCESS (KDD) cleaned data “pre”processing data analysis data mining • analytical • statistical • visual models • noisy/missing data • feature selection information sources model/pattern evaluation data • quantitative • qualitative data management • databases • data warehouses new data model/patterns deployment • prediction • decision support “good” model
PUTTING ALL TOGETHER … • Data / Text / Information Integration • Mining over data and text combined • Visualization • Other real-world issues • Developing tools and techniques that are efficient, scalable, and user friendly
INTERDISCIPLINARY TECHNIQUES COME FROM MULTIPLE FIELDS • Natural Language Processing • Biology and Biomedicine (AI) Computational • Contributes domain knowledge Linguistics • Contributes text analysis • Machine Learning (AI) techniques • Contributes (semi-)automatic induction of empirical laws from • Databases observations & experimentation • Statistics • Contributes language, framework, and techniques • Pattern Recognition • Contributes pattern extraction and pattern matching techniques • Contributes efficient data storage, data cleansing, and data access techniques • Data Visualization • Contributes visual data displays and data exploration • High Performance Comp. • Contributes techniques to efficiently handling complexity • Signal processing • Image Processing …
QUESTIONS? * Images in this presentation were downloaded from Google images
- Slides: 11