Semantic Data Integration From Syntax and Structural Transformations





































- Slides: 37

Semantic Data Integration: From Syntax and Structural Transformations to Semantics Bertram Ludäscher LUDAESCH@SDSC. EDU Data and Knowledge Systems San Diego Supercomputer Center U. C. San Diego

Outline • Information Integration from a DB Perspective • Part I: XML-Based Mediation – wrapper/mediator approach – based on querying semistructured data & XML • Part II: Model-Based Mediation – – basic ideas & architecture, lifting data to knowledge sources “glue maps” (domain maps, process maps) formal framework: Description Logic, Frame-Logic ongoing/future research: mix of DB & KR techniques • Summary 2

An Online Shopper’s Information Integration Problem El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week? ” addall. com ? Information Integration amazon. com barnes&noble. com half. com “One-World” Mediation A 1 books. com

A Home Buyer’s Information Integration Problem What houses for sale under $500 k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? ? Information Integration Realtor Crime Stats School Rankings “Multiple-Worlds” Mediation Demographics

A Geoscientist’s Information Integration Problem What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3 -D geometry ? How does it relate to host rock structures? ? Information Integration Geologic Map (Virginia) Geo. Chemical “Complex Multiple-Worlds” Mediation Geo. Physical Geo. Chronologic (gravity contours) (Concordia) Foliation Map (structure DB)

A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? ? Information Integration protein localization (NCMIR) sequence info (Ca. PROT) “Complex Multiple-Worlds” Mediation morphometry neurotransmission (SYNAPSE) (SENSELAB)

Information Integration from a DB Perspective • Information Integration Challenge – Given: data sources S_1, . . . , S_k (DBMS, web sites, . . . ) and user questions Q_1, . . . , Q_n that can be answered using the S_i – Find: the answers to Q_1, . . . , Q_n • The Database Perspective: source = “database” Þ S_i has a schema (relational, XML, OO, . . . ) Þ S_i can be queried Þ define virtual (or materialized) integrated views V over S_1, . . . , S_k using database query languages Þ questions become queries Q_i against V(S_1, . . . , S_k) • Why a Database Perspective? – scalability, efficiency, reusability (declarative queries), . . . 7

Technical Issues and Challenges • Integration Method and Architecture – federated DBs, wrapper-mediator approach, GAV/LAV, warehouse/on-demand, . . . • Suitable KRDB Formalisms and Frameworks – XML, DTDs/XML Schema, XPath, XQuery, . . . – RDF(S), Ontologies, Description Logics, DAML+OIL, . . . – querying, deduction, subsumption, classification, . . . • Algorithms and Implementation – query composition, rewriting, reasoning, source capabilities, . . . • Information Integration Scenario and Scope – simple/complex, single/multiple worlds, . . . 8

Information Integration Landscape conceptual complexity/depth high Model-Based Mediation GO Eco. Cyc Ontologies KR formalisms Ribo. Web BLAST UMLS Tambis Bioinformatics Geoinformatics MIA Entrez Cyc Word. Net DB mediation techniques low addall book-buyer one-world home-buyer 24 x 7 consumer conceptual distance multiple-worlds 9

PART I: XML-Based Mediation 10

Abstract (XML-Based) Mediator Architecture USER/Client Query Q o V (S_1, . . . , S_k) Integrated XML View V Integrated View Definition IVD(S_1, . . . , S_k) MEDIATOR XML Queries & Results XML View Wrapper S_1 S_2 S_k 11

XMAS: XML Matching And Structuring language CONSTRUCT <books> <book> $a 1 $t <pubs> $p { $p } </pubs> </book> { $a 1, $t } </books> WHERE <books. book> $a 1 : <author /> $t : <title /> </> IN WRAP(“amazon. com”) AND <authors. author> $a 2 : <author /> <pubs> $p : <pub/> </> IN WRAP(“www. . . DBLP…”) AND value( $a 1 ) = value( $a 2 ) XMAS Integrated View Definition: “Find publications from amazon. com and DBLP, join on author, group by authors and title” XMAS Algebra 12

PART II: Model-Based Mediation 13

What’s the Problem with XML & Complex Multiple-Worlds? • XML is Syntax – DTDs talk about element nesting – XML Schema schemas give you data types – need anything else? => write comments! • Domain Semantics is complex: – implicit assumptions, hidden semantics 1. sources seem unrelated to the non-expert 1. Need Structure and Semantics beyond XML trees! 1. 2. 3. 4. employ richer OO models make domain semantics and “glue knowledge” explicit use ontologies to fix terminology and conceptualization avoid ambiguities by using formal semantics 14

XML-Based vs. Model-Based Mediation CM ~ {Descr. Logic, ER, UML, RDF/XML(-Schema), …} Integrated-DTD : = Glue Maps XML-QL(Src 1 -DTD, . . . ) DMs, PMs CM-QL ~ {F-Logic, DAML+OIL, …} Integrated-CM : = CM-QL(Src 1 -CM, . . . ) No Domain Constraints IF THEN IF IF THEN Structural Constraints (DTDs), Parent, Child, Sibling, . . . A = (B*|C), D B =. . . C 1 C 2. . XML Elements XML Models Raw Data Raw. Data C 3 R. . . Logical Domain Constraints Classes, Relations, is-a, has-a, . . . (XML) Objects Conceptual Models

What’s the Glue? What’s in a Link? • Syntactic Joins – (X, Y) : = X. SSN = Y. SSN – (X, Y) : = X. UMLS-ID = Y. UID X Y equality • “Speciality” Joins – (X, Y, Score) : = BLAST(X, Y, Score) similarity • Semantic/Rule-Based Joins – (X, Y, C) : = X isa C, Y isa C, BLAST(X, Y, S), S>0. 8 homology, lub – (X, Y, [produces, B, increased_in]) : = X produces B, B increased_in Y. rule-based e. g. , X= -secretase, B=beta amyloid, Y=Alzheimer’s disease 16

Model-Based Mediation Methodology. . . • Lift Sources to export CMs: CM(S) = OM(S) + KB(S) + CON(S) • Object Model OM(S): – complex objects (frames), class hierarchy, OO constraints • Knowledge Base KB(S): – explicit representation of (“hidden”) source semantics – logic rules over OM(S) • Contextualization CON(S): – situate OM(S) data using “glue maps” (GMs): Þ domain maps DMs (ontology) = terminological knowledge: concepts + roles Þ process maps PMs = “procedural knowledge”: states + transitions 17

. . . Model-Based Mediation Methodology • Integrated View Definition (IVD) – declarative (logic) rules with object-oriented features – defined over CM(S), domain maps, process maps – needs “mediation engineers” = domain + KRDB experts • Knowledge-Based Querying and Browsing (runtime): – mediator composes the user query Q with the IVD. . . rewrites (Q o IVD), sends subqueries to sources. . . post-processes returned results (e. g. , situate in context) 18

Model-Based Mediator Architecture USER/Client “Glue” Maps GMs CM (Integrated View) Domain. Maps Domain Maps DMs DMs Integrated View Definition IVD Mediator Engine FL rule proc. XSB Engine Graph proc. Domain. Maps Domain Process Maps DMs PMs semantic context CON(S) GCM GCM CM S 1 CM S 2 CM S 3 LP rule proc. First results & Demos: CM Queries & Results (exchanged in XML) CM(S) = OM(S)+KB(S)+CON(S) CM-Wrapper (XML-Wrapper) S 1 S 2 KIND prototype, formal DM semantics, PMs [SSDBM 00] [VLDB 00] [ICDE 01] [NIH-HB 01] S 3 19

Domain Maps (Ontologies) as Glue Knowledge Sources • Domain Map = Ontology – representation of terminological knowledge • Use in Model-Based Mediation – (derived) concepts as “drop points”, “anchor points”, “context” for source classes – compile-time use: view definition, subsumption, classification, . . . – runtime use: querying/deduction, path queries, . . • Formalisms: – Semantic nets, Thesauri, Frame-logic, Description logics, . . . 20

Ontologies • So what is an Ontology? – – – definition of things that are relevant to your application representation of terminological knowledge (“TBox”) explicit specification of a conceptualization concept hierarchy (“is-a”) further semantic relationships between concepts abstractions of relational schemas, (E)ER, UML classes, XML Schemas • Examples: – – NCMIR ANATOM GO (Gene Ontology) UMLS (Unified Medical Language System CYC 21

Formalism for Ontologies: Description Logic • DL definition of “Happy Father” (Example from Ian Horrocks, U Manchester, UK) 22

Description Logics • Terminological Knowledge (TBox) – Concept Definition (naming of concepts): – Axiom (constraining of concepts): => a mediators “glue knowledge source” • Assertional Knowledge (ABox) – the marked neuron in image 27 => the concrete instances/individuals of the concepts/classes that your sources export 23

Description Logic Statements as F-logic Rules • In F-logic: X : happy. Father : -X : man, (X. . child) : blue, (X. . child) : green, not ( (X. . child) : poorunhappy. Child ). C : poorunhappy. Child : -not C : rich, not C : happy. • Alternatively: DLs as fragments of First-Order Logic 24

Querying vs. Reasoning • Querying: – given a DB instance I (= logic interpretation), evaluate a query expression (e. g. SQL, FO formula, Prolog program, . . . ) – boolean query: check if I |= (i. e. , if I is a model of ) – (ternary) query: { (X, Y, Z) | I |= (X, Y, Z) } => check happy. Fathers in a given database • Reasoning: – check if I |= implies I |= for all databases I, – i. e. , if => – undecidable for FO, F-logic, etc. – Descriptions Logics are decidable fragments Þ concept subsumption, concept hierarchy, classification Þ semantic tableaux, resolution, specialized algorithms 25

What’s in an Answer? (What’s in a Link? revisited) X Y • Semantic/Rule-Based Joins – (X, Y, [produces, B, increased_in]) : = X produces B, B increased_in Y. rule-based e. g. , X= -secretase, B=beta amyloid, Y=Alzheimer’s disease • What is the Erdoes number of person P? – 3 • Really? Why? – authority based: <VIP> said so – faith based: don’t know but believe firmly – query statement Q =. . . derived it from DB – query Q =. . . derived it from DB and KB using derivation D Þ logic-based systems often “come with explanations” Þ “computations as proofs” 26

Formalizing Glue Knowledge: Domain Map for SYNAPSE and NCMIR Domain Map = labeled graph with concepts ("classes") and roles ("associations") • additional semantics: expressed as logic rules (F-logic) Purkinje cells and Pyramidal cells have dendrites that have higher-order branches that contain spines. Dendritic spines are ion (calcium) regulating components. Spines have ion binding proteins. Neurotransmission involves ionic activity (release). Ion-binding proteins control ion activity (propagation) in a cell. Ion-regulating components of cells affect ionic activity (release). Domain Expert Knowledge Domain Map (DM) DM in Description Logic 27

Source Contextualization & DM Refinement In addition to registering (“hanging off”) data relative to existing concepts, a source may also refine the mediator’s domain map. . . Þ sources can register new concepts at the mediator. . . 28

Example: ANATOM Domain Map

Browsing Registered Data with Domain Maps 30

Query Processing “Demo” Integrated View Definition DERIVE protein_distribution(Protein, Organism, Brain_region, Feature_name, Anatom, Value) IF Contextualization CON(Result) wrt. ANATOM. I: protein_label_image[ proteins ->> {Protein}; organism -> Organism; anatomical_structures ->> {AS: anatomical_structure[name->Anatom]}] , % from PROLAB NAE: neuro_anatomic_entity[name->Anatom; % from ANATOM located_in->>{Brain_region}], AS. . segments. . features[name->Feature_name; value->Value]. Query results in context • provided by the domain expert and mediation engineer • deductive OO language (here: F-logic)

Some Open Database & Knowledge Representation Issues • Mix of Query Processing and Reasoning – Fa. CT description logic reasoner for DMs? – or reconcilation of DMs via argumentation-frameworks (“games”) using well-founded and stable models of logic programs [ICDT 97, PODS 97, TCS 00] • Modeling “Process Knowledge” => Process Maps – formal semantics? (dynamic/temporal/Kripke models? ) – executable semantics? (Statelog? ) • Graph Queries over DMs and PMs – expressible in F-logic [Inf. System 98] – scalability? (UMLS Domain Map has millions of entries) • . . . 32

Process Maps with Abstractions and Elaborations: => From Terminological to Procedural Glue • nodes ~ states • edges ~ processes, transitions • blue/red edges: • processes in Src 1/Src 2 • general form of edges: how about these? 33

Summary: Mediation Scenarios & Techniques Federated Databases One-World Common Schema XML-Based Mediation Model-Based Mediation One-/Multiple-Worlds Complex Multiple-Worlds Mediated Schema Common Glue Maps SQL, rules XML query languages DOOD query languages Schema Transformations Syntax-Aware Mappings Syntactic Joins DB expert Semantics-Aware Mappings “Semantic” Joins via Glue Maps KRDB + domain expert 34

Models and Formal Approaches: Relating Theory to the World © 2000 by John F. Sowa, http: //www. jfsowa. com/krbook/, Knowledge Representation: Logical, Philosophical, and Computational Foundations, Brooks/Cole, Pacific Grove, CA. All models are wrong, but some are useful! 35

Questions? Queries? 36

Some References • XML-Based and Model-Based Mediation: – MBM: Model-Based Mediation with Domain Maps, B. Ludäscher, A. Gupta, M. E. Martone, 17 th Intl. Conference on Data Engineering (ICDE), Heidelberg, Germany, IEEE Computer Society, 2001. – VXD/Lazy Mediaors: Navigation-Driven Evaluation of Virtual Mediated Views, B. Ludäscher, Y. Papakonstantinou, P. Velikhov, Intl. Conference on Extending Database Technology (EDBT), Konstanz, Germany, LNCS 1777, Springer, 2000. – DOOD: Managing Semistructured Data with FLORID: A Deductive Object-Oriented Perspective, B. Ludäscher, R. Himmeröder, G. Lausen, W. May, C. Schlepphorst, Information Systems, 23(8), Special Issue on Semistructured Data, 1998. • STATELOG (Logic Programming with States) – On Active Deductive Databases: The Statelog Approach, G. Lausen, B. Ludäscher, and W. May. In Transactions and Change in Logic Databases, Hendrik Decker, Burkhard Freitag, Michael Kifer, and Andrei Voronkov, editors. LNCS 1472, Springer, 1998. • Argumentation Frameworks as Games – Games and Total Datalog. Neg Queries, J. Flum, M. Kubierschky, B. Ludäscher, Theoretical Computer Science, 239(2), pp. 257 -276, Elsevier, 2000. – Referential Actions as Logical Rules, B. Ludäscher, W. May, G. Lausen, Proc. 16 th ACM Symposium on Principles of Database Systems (PODS'97), Tucson, Arizona, ACM Press, 1997. 37