CSE 300 Data Mining Cyberinfrastructures in Biomedical Informatics
CSE 300 Data Mining & Cyberinfrastructures in Biomedical Informatics Ryan Mc. Givern CSE 5095 May 1, 2011 Data Mining and Cyberinfrastructures in Biomedical Informatics - 1
Main Concepts CSE 300 m Data Mining q Knowledge Discovery m Cyberinfrastructures q Collaborative Research Data Mining and Cyberinfrastructures in Biomedical Informatics - 2
Nature of Biomedical Data CSE 300 m Health care is more than numbers and readings q Can’t replace the subjective sense of disease severity that a physician has in moments m Capture data in a way that best captures observation q Data representation q Precision Data Mining and Cyberinfrastructures in Biomedical Informatics - 3
Review m Medical datum q Any single observation of a patient m Knowledge q Derived through formal/informal analysis of data m Information q Combine knowledge with data for new information CSE 300 Ø Heuristics and research models m BMI Data-Knowledge Spectrum q What information constitutes the substance of medicine Data Mining and Cyberinfrastructures in Biomedical Informatics - 4
Nature of Biomedical Data CSE 300 m Knowledge at one level of abstraction might be considered data at another m Medical Database is a Collection of individual patient observations q EHR is in some sense simply a database m Using historical patient data from the EHR system can facilitate the deduction of new knowledge related to health care strategies Data Mining and Cyberinfrastructures in Biomedical Informatics - 5
Nature of Biomedical Data CSE 300 m Humans can intuitively decompose information from unitary view of data q But nothing is intuitive to computational systems m Example q Clinical setting Ø BP of 120/80 may suffice to indicate a normal reading q Analytical setting Ø Systolic BP = 120 mm Hg Ø Diastolic BP = 80 mm Hg Data Mining and Cyberinfrastructures in Biomedical Informatics - 6
Nature of Biomedical Data CSE 300 m Data mining in health is mainly related to Clinical Research Support q Clinical Data Repositories (CDRs) m New knowledge learned through aggregated info from a large number of patients q Can be facilitated by EHRs m Unfortunately q CDRs generally limited to admin data sources q Rarely store patient charts Data Mining and Cyberinfrastructures in Biomedical Informatics - 7
Nature of Biomedical Data CSE 300 m CDRs support Clinical Research Studies m Retrospective studies q Investigate a hypothesis that was not a subject of the study at the time the data were collected m Prospective studies q Clinical hypothesis known in advance q Research protocol designed to collect future data Data Mining and Cyberinfrastructures in Biomedical Informatics - 8
Nature of Biomedical Data CSE 300 m Knowledge base q Facts q Heuristics q Complex models q Semantic linking Ø Conduct case based problem solving m Medical data is intrinsically heterogeneous q Illusory to conceive ‘complete medical dataset’ q Data selective based on treatment Data Mining and Cyberinfrastructures in Biomedical Informatics - 9
Data Mining in BMI CSE 300 m Data mining q Knowledge discovery technique q Sophisticated statistical methods q Identify trend patterns hidden amongst the sheer size of the dataset m Data warehouse q Multiple heterogeneous data sources Ø Organized under a unified schema q q Single site Facilitate management and decision making Data Mining and Cyberinfrastructures in Biomedical Informatics - 10
Data Mining in BMI CSE 300 m CDR is essentially a data warehouse m Architecture consists of four tiers q External data sources Ø Operational databases, flat files, etc. q Data storage layer Ø Unified schema, metadata, data marts q OLAP Layer Ø Data mining engine q Presentation layer Ø GUI Ø Usually web-based Data Mining and Cyberinfrastructures in Biomedical Informatics - 11
Data Mining in BMI CSE 300 Figure: Clinical Data Repository Data Mining and Cyberinfrastructures in Biomedical Informatics - 12
Data Mining in BMI CSE 300 m Data integration mechanism q Extraction q Transformation q Refresh q Scrubbing m Data marts q Subsets of data tailored to a user group q Cache resultant datasets Data Mining and Cyberinfrastructures in Biomedical Informatics - 13
Data Mining in BMI CSE 300 m Data integration q Heterogeneous data under a unified schema q Ontologies Ø Link primary data expressions to structured vocabularies Ø Data now available to search and algorithmic processing at different levels of abstraction q Clinical domain Ø Notorious for overwhelming presence of natural language text Ø Natural language processing Data Mining and Cyberinfrastructures in Biomedical Informatics - 14
Data Mining in BMI CSE 300 m Data integration q Cancer Biomedical Informatics Grid (ca. BIG) Ø Seeks to integrate all cancer research data Ø Standardize the way by which data is acquired, formatted, processed, and stored – Whole data ‘life cycle’ q Translational research Ø No common architecture among vocabularies Ø Therefore difficult to consolidate terms into a single system Data Mining and Cyberinfrastructures in Biomedical Informatics - 15
Data Mining in BMI CSE 300 m Communication q HL 7 Ø Communication standard for exchange of all information relevant to health care Ø Focuses on meta-level of data integration within a clinical setting Data Mining and Cyberinfrastructures in Biomedical Informatics - 16
Data Mining in BMI m Online Analytical Processing Layer (OLAP) q Formats aggregated data in multidimensional way q Evaluated and visualized at presentation layer q User specifies summary technique m Data Cube q Roll-up and drill-down operations q Control abstraction level for each data dimension CSE 300 Data Mining and Cyberinfrastructures in Biomedical Informatics - 17
Data Mining in BMI m CSE 300 Data mining techniques q Descriptive methods Ø Mine for relationships among attribute types with as few variables as possible q Predictive methods Ø Iterate through attributes and classify data into predefined classes Ø Identify similar classes q Other related methods Ø Neural Networks Ø Machine Learning m Each provides a way of recognizing data patterns Data Mining and Cyberinfrastructures in Biomedical Informatics - 18
Data Mining in BMI CSE 300 UWV & VCU (2006) m Data mining research m 667, 00 digital records q q m Duke University (1997) m Perinatal outcomes m 45, 922 patient records Out-patient & in-patient De-identified q q Health. Miner® (IBM) q Clini. Miner® Ø Association analysis q q THOTH Ø Predictive analysis q m 215, 626 encounters 3, 898, 887 lab results 217, 453 procedures 3, 016, 313 physical findings SQL Queries q q Average time 3 minutes Longest time 12 minutes Ø 4 million records Data Mining and Cyberinfrastructures in Biomedical Informatics - 19
Data Mining in BMI CSE 300 m Challenges in mining biomedical data q Non-hypothesis driven approaches Ø Combinatorial explosion Ø Degree of non-reducibility – Minimize with sophisticated heuristics q High dimensionality Ø Sparse complex relationships – Spread thinly across many dimensions q Hypotheses Ø Limit inherent bias in traditional clinical data analysis Data Mining and Cyberinfrastructures in Biomedical Informatics - 20
Data Mining in BMI CSE 300 m Challenges in warehousing biomedical data q IT infrastructure for CDRs Ø Established for clinical trials but separated from EHR systems q Data integration Ø Map clinical terminologies to clinical research standards q Pseudonymization Ø De-identification is a ‘must’ when EHR leaves the realm of primary health care Data Mining and Cyberinfrastructures in Biomedical Informatics - 21
Data Mining in BMI CSE 300 m General road blocks q Data sharing Ø Researchers are protective of their data q Language/vocabulary changes Ø Due to required detail Ø Bedside vs. laboratory q Transdisciplinary research leads to competing standards Data Mining and Cyberinfrastructures in Biomedical Informatics - 22
Data Mining in BMI CSE 300 m Advantages of mining biomedical data q New health management strategies Ø Relationships among patient observations Ø Understanding of disease progression q Undetected drug events Ø Prevalence through larger sample populations q Clinical trial cohort selection Ø Identify patient types that will best prove a given hypothesis Data Mining and Cyberinfrastructures in Biomedical Informatics - 23
Cyberinfrastructures in BMI CSE 300 m Motivations q Computer systems are now more than essential to research q Development of complex modeling tools Ø But generally only available to a handful of clinical researchers q Integration of data from different disciplines Ø Can require specialized training in mathematics, statistics, and software Ø Ideally want to provide a layer of abstraction that can make this integration transparent to the researcher Data Mining and Cyberinfrastructures in Biomedical Informatics - 24
Cyberinfrastructures in BMI CSE 300 m Mission q Develop a geographically distributed virtual research community that facilitates Ø Data sharing – Data warehousing Ø Computational resource sharing – Distributed grid computing Ø Collaboration – Research management – Research protocol sharing Data Mining and Cyberinfrastructures in Biomedical Informatics - 25
Cyberinfrastructures in BMI CSE 300 m Components of a cyberinfrastructure q Data infrastructure Ø Series of interconnected repositories q Computational infrastructure Ø Registered resource sharing q Communication infrastructure Ø Communication amongst architectures q Human infrastructure Ø Facilitate communication and collaboration between registered researchers Data Mining and Cyberinfrastructures in Biomedical Informatics - 26
Cyberinfrastructures in BMI CSE 300 Data Mining and Cyberinfrastructures in Biomedical Informatics - 27
Cyberinfrastructures in BMI CSE 300 m Data infrastructure q Network of databases q Facilitates remote storage, integration, and retrieval of data q Databases browsed by web based front-ends q Can be extended to cater to Ø Automatic acquisition Ø Direct submission q Allows for pulling of data into local repositories Ø For private or semi-private analyses Data Mining and Cyberinfrastructures in Biomedical Informatics - 28
Cyberinfrastructures in BMI CSE 300 m Computational infrastructure q Shared access to hardware and software Ø Intensive computation needed for sophisticated analyses – i. e. Image analysis software q Essentially a computing grid Ø Systems separated geographically but clustered over the web Ø Provides a virtual consolidated supercomputing node Ø If system is idle locally, it is raised as a resource for outsiders Data Mining and Cyberinfrastructures in Biomedical Informatics - 29
Cyberinfrastructures in BMI CSE 300 m Communication infrastructure q At the low level Ø Require connectivity and acceptable bandwidth between – Repositories – Computational resources – Researcher q At the high level Ø Responsible for maintaining syntactic and semantic harmony throughout data Data Mining and Cyberinfrastructures in Biomedical Informatics - 30
Cyberinfrastructures in BMI CSE 300 m Communication infrastructure continued q Syntax and Semantics Ø Suppose analysis involves data from different repositories Ø Syntactic connectivity established through a common format for data organization Ø Semantic connectivity maintains data interoperability by ensuring concepts captured by the data share a common terminology – Usually implemented using an ontology Data Mining and Cyberinfrastructures in Biomedical Informatics - 31
Cyberinfrastructures in BMI CSE 300 m Human infrastructure q Ultimately, must facilitate the sociology of science q Everyone curates communal data sets q Encourage the sharing of Ø Protocols Ø Analysis algorithms Ø Data sets q Similar to CICATS at UConn Ø Research toolkit Data Mining and Cyberinfrastructures in Biomedical Informatics - 32
Cyberinfrastructures in BMI CSE 300 m Human infrastructure continued q Ideally researcher should be able to design experiment at a high level Ø Describe datasets, relationships, etc. Ø Generally high level description language – Workflow language q q Infrastructure then manages data retrieval, analysis, and transformation Constructs an environment where researchers can get an in-depth result from a high level description Data Mining and Cyberinfrastructures in Biomedical Informatics - 33
Cyberinfrastructures in BMI CSE 300 m m There are many existing cyberinfrastructures q Don’t necessarily implement all components Most common form is an online database q Gen. Bank q EMBL Ø European Molecular Biology Lab q Uni. Prot Ø Protein database q PDB Ø Protein data bank Data Mining and Cyberinfrastructures in Biomedical Informatics - 34
Cyberinfrastructures in BMI CSE 300 m Online databases continued q But, these lack the components to facilitate Ø Collaboration Ø Interdisciplinary research q q Use centralized resources and are generally managed by the owning research group Data centric Ø Most of the computational architecture is dedicated solely to data acess Data Mining and Cyberinfrastructures in Biomedical Informatics - 35
Cyberinfrastructures in BMI CSE 300 m Community Annotation Hubs q Open up a centralized database to direct contribution from the research community q SDSU Gene Wiki Ø For the community annotation of gene function q BMI Wikis have been recognized as some of the most sophisticated document repositories Ø Despite being a relatively recent umbrella discipline q Still not a complete research environment Ø Could be ‘plugged in’ to the human infrastructure of a complete cyberinfrastructure Data Mining and Cyberinfrastructures in Biomedical Informatics - 36
Cyberinfrastructures in BMI CSE 300 m Data sharing q Still difficult to share data on disparate information classes Ø Even if they are related through a subset of attribute types q Further difficulty of interconnecting similar repositories written by different research groups Ø Differing technologies Ø Differing data representations q Reoccurring difficulty in integrating data Ø Medical data is inherently heterogeneous – Massive amount of data types involved – Data is captured differently, because it’s used differently Data Mining and Cyberinfrastructures in Biomedical Informatics - 37
Cyberinfrastructures in BMI CSE 300 m Data sharing challenge q As Dr. Kevin Sullivan said in the discussion Ø It is difficult for an institution to share their data Ø It can be difficult to argue a business case to do so – Institutions may not want people evaluating their treatments or incorrect treatments – Research groups get a sense of proprietary ownership over their data – Some institutions feel it is not theirs to share – Others are skeptical as to how the community would react to their health care provider exposing information to outsiders Data Mining and Cyberinfrastructures in Biomedical Informatics - 38
Cyberinfrastructures in BMI CSE 300 m One interoperability solution is web services q Provide common technology for heterogeneous data and services to interoperate q Common implementations consist of Ø Web Service Description Language (WSDL) – Describes capabilities of services Ø Simple Object Access Protocol (SOAP) q Researchers never use services directly Ø But rely on the analysis and visualization engines that run on top of these Data Mining and Cyberinfrastructures in Biomedical Informatics - 39
Cyberinfrastructures in BMI CSE 300 m Globus q Open source libraries q Industry heavyweight in web services for many domains q Provides mechanisms for Ø Announcing the availability of a computer resource Ø Discovering the resource Ø Invoking the resource Used by BIRN and ca. BIG Bio. MOBY q Similar to Globus but relatively lightweight q Used by Pla. Net Consortium q m Data Mining and Cyberinfrastructures in Biomedical Informatics - 40
Cyberinfrastructures in BMI CSE 300 m Ontologies q Web services allow heterogeneous data and services to exchange Ø But this does not enforce data semantics q q Ontologies are used to ensure an unambiguous standard for data Most BMI cyberinfrastructures specify ontologies using OWL Data Mining and Cyberinfrastructures in Biomedical Informatics - 41
Cyberinfrastructures in BMI CSE 300 m Biomedical Informatics Research Network (BIRN) q Developed a robust software installation & deployment system to implement a BIRN endpoint Ø Host data and contribute computational resources Ø Access shared datasets through web portal Ø Analysis and visualization tools Ø Publish datasets through BIRN data repository q q Roughly $20 k for a BIRN rack Technologies Ø Globus: grid management Ø BIRNLex: ontology Data Mining and Cyberinfrastructures in Biomedical Informatics - 42
Cyberinfrastructures in BMI CSE 300 m Cancer Biomedical Informatics Grid q Launched 2003 q Mission Ø Provide a common information platform to support the diverse clinical and basic research of the US National Cancer Institute – 87 cancer institutes at the time q Highly heterogeneous datasets Data Mining and Cyberinfrastructures in Biomedical Informatics - 43
Cyberinfrastructures in BMI CSE 300 m Future of BMI cyberinfrastructures q Use of cyberinfrastructure is growing rapidly q Grid computing is increasingly more efficient q Current weaknesses related to cross-discipline collaboration Ø Each implements an internally consistent grid, but isolated from each other Ø We need integration and communication among disciplines to investigate further relationships Ø Data interoperability may be resolved by semantic web q Current research in cyberinfrastructures is related to using the semantic web concept Data Mining and Cyberinfrastructures in Biomedical Informatics - 44
Cyberinfrastructures in BMI CSE 300 m Semantic web in cyberinfrastructures q Web services make strong distinction between data and data operations Ø User identifies service to invoke Ø Formats the input data Ø Invokes the service Ø Unpacks and interprets the results q Semantic web is a technology tolerant of diverse data models Ø No data transformation services Ø Just pieces of information and relationships between them Data Mining and Cyberinfrastructures in Biomedical Informatics - 45
- Slides: 45