Clio Schema Mapping and Data Exchange University of

  • Slides: 26
Download presentation
Clio: Schema Mapping and Data Exchange University of Toronto • Renée J. Miller •

Clio: Schema Mapping and Data Exchange University of Toronto • Renée J. Miller • Periklis Andritsos, Ariel Fuxman • Tasos Kemesetsidis, Yannis Velegrakis IBM Almaden • • 1/22/2022 Laura Haas Ron Fagin, Mauricio Hernández Howard Ho, Phokion Kolaitis, Felix Naumann Lucian Popa, Ling-Ling Yan R. J. Miller 1

Schema Mapping & Data Exchange • Wants data from S • Understands T •

Schema Mapping & Data Exchange • Wants data from S • Understands T • May not understand S • XML Schema • DTD • Relational Source schema S “conforms to” High-level correspondences Mapping Creation Target schema T “conforms to” data Mapping (semantically correct) • User interactively chooses among (or corrects) mappings • Data may be exchanged (requires a translation query) 1/22/2022 R. J. Miller 2

Mapping is not Schema Integration Integrated schema Source schema S 1 “conforms to” Source

Mapping is not Schema Integration Integrated schema Source schema S 1 “conforms to” Source schema Sn “conforms to” data Design Problem: integrated schema designed to match sources 1/22/2022 R. J. Miller 3

Overview n Goal: interoperability between independent data sources – Schema Mapping – Data Exchange

Overview n Goal: interoperability between independent data sources – Schema Mapping – Data Exchange n Challenges – Schemas can be arbitrarily different – Still, data must not lose its meaning during translation • Maximum advantage of semantics embedded in schemas & data – Used in compilation • Facilitate user specification of any additional semantics – As by-product user learns if semantics incorrect/incomplete – Performed manually: complex user queries, programs, etc. n Output: correct data translation program 1/22/2022 R. J. Miller 4

Illustration: Clio Schema Mapping § § Support Nested Structures Element correspondences § § n

Illustration: Clio Schema Mapping § § Support Nested Structures Element correspondences § § n Human friendly Automatic discovery Preserve data meaning n Discover data associations n Use constraints & schema n Create New Target Values n Produce Correct Grouping n And … 1/22/2022 R. J. Miller 5

… Generate Queries (XQuery) <statistics. DB> { FOR $x 0 IN /expense. DB/grant, $x

… Generate Queries (XQuery) <statistics. DB> { FOR $x 0 IN /expense. DB/grant, $x 1 IN /expense. DB/project, $x 2 IN /expense. DB/company WHERE $x 2/cid/text() = $x 0/cid/text() $x 0/project/text() = $x 1/name/text() RETURN <city. Statistics> { FOR $x 0 L 1 IN /expense. DB/grant, $x 1 L 1 IN /expense. DB/project, $x 2 L 1 IN /expense. DB/company WHERE $x 2 L 1/cid/text() = $x 0 L 1/cid/text() $x 0 L 1/project/text() = $x 1 L 1/name/text() $x 2/city/text() = $x 2 L 1/city/text() RETURN <organization> <cid> { $x 0 L 1/cid/text() } </cid> <cname> { $x 2 L 1/name/text() } </cname> { FOR $x 0 L 2 IN /expense. DB/grant, $x 1 L 2 IN /expense. DB/project, $x 2 L 2 IN /expense. DB/company WHERE $x 2 L 2/cid/text() = $x 0 L 2/cid/text() $x 0 L 2/project/text() = $x 1 L 2/name/text() $x 2 L 1/name/text() = $x 2 L 2/name/text() $x 2 L 1/city/text() = $x 2 L 2/city/text() $x 0 L 1/cid/text() = $x 0 L 2/cid/text() RETURN <funding> ………………. . 1/22/2022 R. J. Miller 6

Outline n Schema Mapping (VLDB 00, VLDB 02) – Interpret syntactic correspondences • Using

Outline n Schema Mapping (VLDB 00, VLDB 02) – Interpret syntactic correspondences • Using semantics of schemas and data – Generate logical mapping n Data Translation (VLDB 02) – Query generation – Accommodate differences in data content • Skolemization: used to create missing data n Data Exchange (ICDT 03) – Formal foundation for translation semantics n Understand Mapping (SIGMOD 01) – Explain mappings, resolve ambiguities using data 1/22/2022 R. J. Miller 7

Creating Correspondence § Graphical User Interface § § Automatic Discovery § § Attribute (Element)

Creating Correspondence § Graphical User Interface § § Automatic Discovery § § Attribute (Element) Classifier Extensible to § § § DBA interactively specifies Other Schema Matchers VLDB J. 01 Survey Correspondence based on syntactic information § Within schema or data 1/22/2022 R. J. Miller 8

Interpreting Correspondences expense. DB: Rcd companies: Set of Rcd company: Rcd cid name city

Interpreting Correspondences expense. DB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project What semantics do we associate to an arrow? Good enough for one arrow ! Still works for two arrows! How about now ? 1/22/2022 stat. DB: Set of Rcd city. Stat: Rcd orgs: Set of Rcd org: Rcd cid name fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount city cid expense. DB. companies cid stat. DB. city. Stat. orgs cid, name expense. DB. companies cid, name stat. DB. city. Stat. orgs gid expense. DB. grants gid stat. DB. city. Stat. orgs. fundings R. J. Miller 9

Data Associations stat. DB: Set of Rcd city. Stat: Rcd orgs: Set of Rcd

Data Associations stat. DB: Set of Rcd city. Stat: Rcd orgs: Set of Rcd org: Rcd cid name fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount city expense. DB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project n We must recognize that grants are associated to companies n Association (in the source): grants ⋈ companies Association (in the target): stat. DB ⋈ orgs ⋈ fundings ⋈ financials n 1/22/2022 R. J. Miller 10

Logical Associations expense. DB: Rcd companies: Set of Rcd company: Rcd cid name city

Logical Associations expense. DB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount sponsor project name gid amount 1/22/2022 M 1 expense. DB. companies expense. DB. grants ⋈ expense. DB. companies stat. DB: Set of Rcd city. Stat: Rcd city orgs: Set of Rcd org: Rcd cid name fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount M 2 name gid amount R. J. Miller stat. DB. city. Stat. orgs stat. DB [ city. Stat. orgs. org. fundings ⋈ city. Stat. financials ] 11

Multiple Logical Associations stat. DB: Set of Rcd city. Stat: Rcd orgs: Set of

Multiple Logical Associations stat. DB: Set of Rcd city. Stat: Rcd orgs: Set of Rcd org: Rcd cid name fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount city expense. DB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project sponsor n Grants may be associated with companies in multiple ways n Association 1: grants ⋈ companies join on cid = cid n Association 2: grants ⋈ companies join on sponsor = cid n We do not make the one flavor assumption (URA) 1/22/2022 R. J. Miller 12

Schema Mapping n Enumerate ALL logical associations consistent with schema semantics • Constraints •

Schema Mapping n Enumerate ALL logical associations consistent with schema semantics • Constraints • Nesting (schema structure) • Data (statistical models of data) n Interpret correspondences (arrows) over pair source & target association 1/22/2022 R. J. Miller 13

Features of Algorithm n Generate associations – Modified dependency chase algorithm – Nested schemas

Features of Algorithm n Generate associations – Modified dependency chase algorithm – Nested schemas – Broad class of (nested) referential constraints • Including (nested) foreign keys • Consider both source and target constraints – Algorithm is complete: generate all associations (in the UR sense) for acyclic dependencies n Logical mapping – Set of (alternative) inter-schema dependencies 1/22/2022 R. J. Miller 14

Mapping Algorithm Attribute-Attribute correspondences Target schema T Source schema S Query (source): - unnest

Mapping Algorithm Attribute-Attribute correspondences Target schema T Source schema S Query (source): - unnest - join Logical Mapping Query (target): - nest - split - create new values 1: Generate logical mappings Source Associations (logical views) 1/22/2022 Target Associations (logical views) 2: Select subset mappings R. J. Miller 3: Data translation 15

Outline n Schema Mapping – Interpret high-level mapping • Relies on semantics of schemas

Outline n Schema Mapping – Interpret high-level mapping • Relies on semantics of schemas – Generate logical (low-level) mapping n Data Translation – Query generation • Retrieve from source (join/unnest) • Insert into target (partition/nest) – Accommodate differences in data content • Skolemization: used to create missing data 1/22/2022 R. J. Miller 16

From Dependencies to Queries n Logical mappings = inter-schema dependencies n Convenient intermediate representation

From Dependencies to Queries n Logical mappings = inter-schema dependencies n Convenient intermediate representation (in Clio) – can be reasoned about (query optimization) – can even be shown (graphically) to a user – but … – not all target attributes determined source – flat representation of how schemas correspond – we need to materialize a nested target 1/22/2022 R. J. Miller 17

Query Generation Issues We translate logical mappings into queries n Two parts, conceptually: n

Query Generation Issues We translate logical mappings into queries n Two parts, conceptually: n – Unnest (join) the source association – Nest (partition) according to the shape of the target association n Skolemization algorithm: the heart of query generation – Achieves a good nesting (grouping) – Generates new values (ids) – Not to many ids, but not too few either ! 1/22/2022 R. J. Miller 18

Query Generation expense. DB: Rcd companies: Set of Rcd company: Rcd cid name city

Query Generation expense. DB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount sponsor project M 2 stat. DB: Set of Rcd city. Stat: Rcd city orgs: Set of Rcd = Sk 2[] org: Rcd cid = Sk 1[name] name fundings: Set of Rcd = Sk 3[name] funding: Rcd gid proj aid = Sk 4[name, gid, amt] financials: Set of Rcd financial: Rcd = Sk 4[name, gid, amt] aid date amount Correspondences map only into some of the atomic attributes n We use Skolem functions to control the creation of the other elements n – sets (this controls how we group elements in the target) – atomic values (this enforces the integrity of the target) 1/22/2022 R. J. Miller 19

Mapping Algorithm QM( 1(Is)) = It Source schema S 1 Source Associations Find Query:

Mapping Algorithm QM( 1(Is)) = It Source schema S 1 Source Associations Find Query: 1/22/2022 Target schema T 2 Target Associations Logical Mapping 1(Is) 2(It ) BUT: we don’t have It we need to create it!! 2 (QM( 1(Is))) = 1(Is) R. J. Miller 20

Correctness Guarantees n Information preservation: – For every logical mapping M: 1 2 generate

Correctness Guarantees n Information preservation: – For every logical mapping M: 1 2 generate query QM : • For any source instance Is : 2 (QM( 1(Is))) = 1(Is) – Map no less and no more data than was intended to be mapped – In fact, 2 and QM are inverse to each other (view inversion) • Lossless data transmission n Target integrity constraints satisfied!! – Target nested referential constraints t – QM( 1(Is)) = It where It ╞ t 1/22/2022 R. J. Miller 21

When Can This Tool Help? n In general, whenever: – schemas are complex (nesting

When Can This Tool Help? n In general, whenever: – schemas are complex (nesting + constraints) or different – it matters whether we lose data or not – it is not obvious to a human user that there are many forms of data that can coexist within a schema n Specific applications: – data migration from one schema to another (schema redesign) – data transmission (between independently created schemas) – data integration and warehousing 1/22/2022 R. J. Miller 22

Project Status: So Far n Can map between any combination of XML schemas/DTDs/Relational schemas

Project Status: So Far n Can map between any combination of XML schemas/DTDs/Relational schemas n The generated queries are XQuery + Skolem fcts – (or XSLT or SQL) n Tested Clio with many complex real data: – Life. Science schemas (private and public Gene. X, Swiss. Prot) – Amalgam integration benchmark, DBLP, Mondial, several others, … n Demos SIGMOD 01 (Relational), ICDE 02 (XML) n Papers: VLDB 00, SIGMOD 01, VLDB 02, ICDT 03 No commercial tool correctly generates XML transformations that are lossless ! 1/22/2022 R. J. Miller 23

Conclusion n Schema mapping framework with solid semantic foundation – Covers relational and core

Conclusion n Schema mapping framework with solid semantic foundation – Covers relational and core of XML Schema – “Maps source associations to target associations” – We build data transformations (XQuery with Skolem functions) rather than just queries n Separated into 2 phases: – Logical mapping (arrows -> dependencies) – Data translation (dependencies -> queries) – It works (and we have users!) 1/22/2022 R. J. Miller 24

Answering Queries Using Views q Source schema S Target schema T st t virtual

Answering Queries Using Views q Source schema S Target schema T st t virtual ! I n J The target is not materialized; it is just a querying interface – Queries are posed on the target schema; data is in the source. – Problem: how to answer the query in the “best” possible way n Classical setting: st have a special form: • GAV: Qs(S) Ti where Ti is a relation in T, Qs is a query on S • LAV: Si Qt(T) where Si is a relation in S , Qt is a query on T • Schema Mapping: Qs(S) Qt(T) and constraints on S and T ( s , t ) n GAV/LAV (mostly) assumes conjunctive queries on relations – (mostly) assumes no target constraints – target is a view 1/22/2022 R. J. Miller 25

Semantics of Query Answering n Answering queries using views – Query is answered using

Semantics of Query Answering n Answering queries using views – Query is answered using source data – Answer is set of tuples in query result on ALL possible target instances: certain answers n Data Exchange – Query is answered using ONE materialized target – Can single target give same information as source(s)? n Is query result the same in both settings? 1/22/2022 R. J. Miller 26