Knowledge Discovery in Grid Datasets Goals Design Concepts












































- Slides: 44
Knowledge Discovery in Grid Datasets – Goals, Design Concepts and the Architecture Peter Brezany University of Vienna P. Brezany University of Vienna
Collecting Data Laboratories Satellites Business Experiments (high energy physics, . . . ) P. Brezany (microscopes, MRI/CT scanners, . . . ) Analysis Data Repositories Computer simulations University of Vienna 2
Motivation • Computational Grid – a new-generation infrastructure • Challenge: Advanced analysis of data managed by Grid • Typical data in modern Grid applications: – files, file collections, relational and XML DBs, virtual data, data objects • The data is often is large, geographically distributed and its complexity is increasing; some applications require special security precautions. • Our research aims: – Phase 1 : Knowledge discovery Grid system (Grid. Miner) – Phase 2 : Intelligent Grid system (Wisdom. Grid) P. Brezany University of Vienna 3
• Motivation Outline • Background and Related Work • Basic Concepts and Grid. Miner Architecture • Grid Data Integration System • Data Mining Layer • Implementation Issues and Experiments • Future Research P. Brezany • Conclusions University of Vienna 4
Background and Related Work • Basic Grid development (Globus 1) – metacomputing • Data Grid (Globus 2, Data. Grid of CERN, etc. ) • Semantic Grid (my. Grid) • Open Grid Service Architecture (Globus 3, OGSA-DAIS) • Parallel and Distributed Data Mining and Data Warehousing • Knowledge Grid (Grid. Miner and work of others) • Web Intelligence P. Brezany University of Vienna 5
Grid. Miner Requirements • Open architecture • Data distribution, complexity, heterogeneity, and large data size • Applying different kinds of analysis strategies • Compatibility with existing Grid infrastructure • Openness to tools and algorithms • Scalability • Grid, network, and location transparency • Security and data privacy • OLAP support P. Brezany University of Vienna 6
Grid. Miner (Layered) Abstract Architecture User Interface Knowledge Grid Data to Knowledge Information Grid Control Computational & Data Grid Built on the K. G. Jeffery‘s proposal P. Brezany University of Vienna 7
Grid. Miner Conceptual Architecture J o b C o n t r o l P. Brezany University of Vienna 8
Service Architecture Based on OGSA-DAIS P. Brezany University of Vienna 9
Data Distribution Scenarios 1. Single data source 2. Federated data sources with different types of partitioning P. Brezany University of Vienna 10
Example Vertical and horizontal distribution of the virtual data source P. Brezany University of Vienna 11
Mapping Schema P. Brezany University of Vienna 12
Grid Data Mediation Services P. Brezany University of Vienna 13
Architecture of a Data Mining System P. Brezany University of Vienna 14
Components of the Data Mining Layer • Grid. Miner Service Factory • Grid. Miner Service Registry • Grid. Miner Data Mining Service • Grid. Miner Preprocessing Service • Grid. Miner Presentation Service • Grid. Miner Orchestration Service P. Brezany University of Vienna 15
Centralized Data Mining P. Brezany University of Vienna 16
Parallel and Distributed Data Mining P. Brezany University of Vienna 17
Grid. Miner Orchestration Service P. Brezany University of Vienna 18
Grid. Miner Job Specification Language P. Brezany University of Vienna 19
Implementation Prototype • Implementation of the Mediation Service for horizontal data partitioning • Implementation of Data Mining Services for decision tree construction as OGSA conformous Grid service, based on the Globus Toolkit 3 Release • We use – a freely available Java-based data mining system Weka (data preprocessing and data mining tasks) – (main memory oriented) – a home-grown Java implementation of the algorithm SPRINT (disk -oriented) P. Brezany University of Vienna 20
Experimental Environment • Test data suites – synthetical data (generated by an extended version of the IBM Quest Synthetic Data Generation Code) – TBI (Traumatic Brain Injury) databases • Grid testbed – – – Vienna CERN Dublin Zagreb Cracow • Goals in the first phases – Verifying model accuracy – Overhead of the service layers P. Brezany University of Vienna 21
Extending the Functionality P. Brezany University of Vienna 22
OLAM P. Brezany University of Vienna 23
Example: Mining Patterns for Data Classification and Associations use database dat 1, dat 2 mine classifications analyze patient_outcome using g_parsimony display as tree P. Brezany use database DBs attributes mine associations using method_attributes display as rules University of Vienna 24
Workflow 1: Interactive Mode P. Brezany University of Vienna 25
Workflow 2: Batch Mode P. Brezany University of Vienna 26
Workflow 3: Hybrid Mode P. Brezany University of Vienna 27
Execution Model Based on Static Workflow P. Brezany University of Vienna 28
Execution Model Based on Dynamic Workflow P. Brezany University of Vienna 29
Towards the Wisdom Grid (WG) P. Brezany University of Vienna 30
WG Architecture Domain Knowledge Agents Knowledge Explorer Agent Wisdom Grid Agent Platform External Knowledge Base External Services Agent Grid Service Knowledge Base Service Knowledge Discovery Service Grid End User (personal) Agent P. Brezany KB University of Vienna 31
Work-Flow External Agents End User Agent Knowledge Base service Knowledge Agent Service Knowledge discovery service Services. . . Knowledge Base P. Brezany Knowledge Explorer Agent University of Vienna 32
Knowledge Discovery Service Client for other services Knowledge Discovery in Databases Grid. Miner data mining on-line analytical processing (OLAP) Web Mining semantic web Online libraries Web/Grid Services Knowledge Explorer Agent P. Brezany University of Vienna 33
Knowledge Base Service / KB KBS - Search, Query, Expand Knowledge Base KB- Database that stores particular data about real objects and relations between these objects and their properties Consists of ontologies and instances Information about resources (location, query lang. ) on the Web web/grid services , agents references to the online database Languages XML/RDF/DAML-OIL/DAML-S/OWL P. Brezany University of Vienna 34
Ontology - example DAML-OIL Language: Patient is Human has Age P. Brezany <daml: Class rdf: ID=“Human”> <rdfs: sub. Class. Of> <daml: Restriction cardinality=“ 1”> <daml: on. Property rdf: resource= “#Age”/> </daml: Restriction> </rdfs: sub. Class. Of> </daml> <daml: Datatype. Property about: ID=“Age”> <rdf: domain rdf: resource = “#Human”/> </daml: Datatype. Property> <daml: Class rdf: ID=“Patient”> <daml: sub. Class. Of rdf: resource=“#Human”/> </daml: Class> University of Vienna 35
Knowledge Base - example Human has Temperature Value is Patient has Attribute attribute: PAT_ID P. Brezany Tables table: PATIENTS University of Vienna has Database jdbc: //foo/hospital 36
Semantic mediator • Distributed heterogeneous databases – Different database schemas – Different query languages – Different names of attributes/tables… but the same semantics ! • WG enables semantics mediation at a higher level P. Brezany University of Vienna 37
Semantic mediator (cont. ) AGE Patient same. Property. As is Human PAT_AGE has Database in Hospital X PAT_TAB Age has ID AGE BT . . . … … Database in Hospital Z Blood Type PATIENTS same. Property. As PAT_BLOOD_TYPE P. Brezany BT PAT_ID PAT_AGE PAT_BLOOD_TYPE . . . … … University of Vienna 38
Distributed Knowledge base uri: foo. Y#Human is subclass Class has property Class property Is same class as uri: foo. Z#Temperature uri: foo. X#Patient class P. Brezany uri: foo. X#Ill_Person University of Vienna 39
Agent Grid Service Supports system with ability to communicate with the outside world in standard languages FIPA Standards ACL – Agent Communication Language KQML- Knowledge Query and Manipulation Language Agent Platform (JADE, FIPA-OS) Agents Domain Knowledge Agent Knowledge Explorer Agent End-user Agent (personal) P. Brezany University of Vienna 40
Querying End-user agent with own ontology – subset of ontology Merging of ontologies without own ontology Negotiating about domain of interest Queries created from ontology Templates <Patient rdf: ID=“ID 001”> <Temperature/> </Patient> P. Brezany University of Vienna 41
Answers • • P. Brezany Mined Knowledge (Grid. Miner) – Decision trees/ rules » (clinical pathways) – Association rules Instances of domain ontology – – Particular data References Links to Web sites Information about another knowledge providers University of Vienna 42
Case Study - Medical Application Semantic Web/Grid Knowledge Explorer Agent Knowledge Agent Q: Outcome? + data about patient’s condition A: probability of survival + references to the diagnoses Knowledge Discovery Service Grid. Miner resources Training set Knowledge Base End User (personal) Agent P. Brezany Testset University of Vienna Hospital Databases 43
Conclusions and Future Work • Application and extension of the Grid technology to knowledge discovery – an important, but nontraditional Grid application domain • Introduction of a new Grid Data Mediation Service • Future work – Performance evaluation on large synthetic data volumes – Coupling of the Data Minining services architecture with the OLAP services architecture – Development of a knowledge discovery oriented Grid Workflow Language and the appropriate Workflow Engine – Application of Grid. Miner to a real medical application (management of patients with severe traumatic brain injuries) – Development of the Wisdom Grid P. Brezany University of Vienna 44