Interactive Mapping Specification with Exemplar Tuples Angela Bonifati

Interactive Mapping Specification with Exemplar Tuples ∗ Angela Bonifati, Ugo Comignani, Emmanuel Coquery, Romuald Thion SIGMOD ’ 17, May 14– 19, 2017, Chicago, IL, USA. 2019/04/23 maeda 1

author Angela Bonifati • 1 -02: Ph. D. degree with highest honors at Politecnico di Milano (Italy). Ph. D Thesis: "Reactive Services for XML Repositories"Advisor: Prof. Stefano Ceri. • 7 -97: M. S. degree with highest honors in Computer Science Engineering at Università della Calabria Rende (Italy). M. S. Thesis: "Knowledge Discovery on Database Schemas To Find Out Structural Properties"-Advisor: Prof. Mimmo Saccà https: //perso. liris. cnrs. fr/angela. bonifati/publications. shtml Full Professor in Computer Science at Lyon 1 University (France), affiliated with the Liris research lab and member of the Database team (since 2015) 2

Introduction • Schema mappings are declarative specifications, typically in firstorder logic, of the semantic relationship between elements of a source schema and a target schema. • Several paradigms have been proposed to aid data architects to specify engineered mappings. 3

Introduction • The first paradigm relies on visual specification of mappings using user-friendly graphical interfaces. • These help the data architects design a mapping between schemas in a highlevel notation. • the generation of mappings in a programming language or in a query language from graphical primitives is dependent of the specific tool. 4

Introduction • Model management operators have been proposed to provide a general-purpose mapping designer that can be adapted to a wide variety of tools for data programmability. • The third paradigm is to generate the desired mappings from representative data examples. 5

Introduction • Notwithstanding the progress made in mapping specification thanks to the aforementioned approaches, all the above paradigms have in common the fact that they are intended for expert users. • They set forth a novel approach for Interactive Mapping Specification (IMS) that bootstraps with exemplar tuples, corresponding to a limited number of tuples provided by non-expert users. 6

Introduction • 7

Introduction 8

Introduction 9

Introduction • They define a mapping specication process for non-expert users that bootstraps with exemplar tuples, and works for general GLAV mappings. • They prove that the refinement process always produces a more general mapping than the canonical mapping. • we experimentally gauge the effectiveness of our approach, by comparing the sizes of exemplar tuples with the size of universal solutions. 10

PRELIMINARIES • 11

PRELIMINARIES • 12

PRELIMINARIES • 13

PRELIMINARIES • 14

PRELIMINARIES • 15

PRELIMINARIES • 16

Exemplar tuples and mappings 17

Example 2 18

Refinement of mappings • 19

Refinement of mappings • 20

Atom refinement • the normalization produces a split-reduced mapping from the canonical mapping in which each tgd has a large left-hand side, say φ. • However, some atoms in φ may be irrelevant, preventing the triggering of a tgd and causing further ambiguities. • Algorithm 1 applies atom refinement on each normalized tgd to alleviate these ambiguities. 21

Semilattice for Atom Refinement • 22

Example 3, 4 {T(id. F 0, id. F 1, id. Ag); TA(id. Ag, name, t); F(id. F 0, t′, t, id. Air); F(id. F 1, t, t′, id. Air); A(id. Air, name′, t)} {F(id. F 0, t′, t, id. Air); A(id. Air, name′, t)} and {T (id. F 0 , id. F 1 , id. Ag); A(id. Air, name′ , t)} The set of universally quantified variables in (3) is {t, id. F 0 , name'} 23

Exploring the semilattice • During the exploration of the space of possible candidates, the user is challenged upon one element of the semilattice at a time. • All the supersets of such candidate can be excluded from further exploration, thus effectively pruning the search space. 24

Example 5 “Are the tuples F(f 0, Miami, L. A. , a 1) and A(a 1, AF, L. A. ) enough to produce Arr(a 1, L. A. , f 0, P 2) and Co(P 2, AF, L. A. )? ” Yes F (id. F 0 , t′ , t, id. Air) ∧ A(id. Air, name′ , t) → ∃id. C 2, Arr(t, id. F 0, id. C 2) ∧ Co(id. C 2, name′, t) 25

Example 5 “Are the tuples T (travel 0, f 1, a 1) and A(a 1, AF, L. A. ) enough to produce Arr(A 1, L. A. , f 0, P 2) and Co(P 2, AF, L. A. )? ” No This implies that we need to continue the exploration on the next level of the semilattice, namely on the sets {T ; T A; A} and {T ; F 2 ; A}. 26

Questioning about atoms set validity • 27

Example 6 e = {F(id. F 0, t’, t, id. Air); A(id. Air, name’, t {( These atoms are built from the set ES = {F(f 0, Miami, L. A. , a 1); A(a 1, AF, L. A. )}, a subset the of instance ES. ET’}= Arr(A 1, L. A. , f 0, P 2); Co(P 2, AF, L. A{(. 28

Join refinement between variables of a tgd • In relational data, multiple occurrences of the same value do not necessarily imply a semantic relationship between the attributes containing such a value. • The canonical mapping may introduce irrelevant joins in the left-hand side of the tgds. • In order to produce the mapping the user has in his mind, we primarily need to distinguish relevant joins from irrelevant ones. 29

Join refinement between variables of a tgd The set of all partitions of W form a complete lattice under the partial order P 0 ≤ P 1 ⇔ ∀x, y ∈ W, x ≡P 0 y ⇒ x ≡P 1 y. 30

Example 7 F (id. F , t’ , t, id. Air) ∧ A(id. Air, name’ , t) →∃id. C , Arr(t, id. F , id. C ) ∧ Co(id. C , name’, t) 0 2 2 Each occurrence of t is replaced with a fresh variable (namely t 1, t 2, t 3 and t 4) yielding. F(id. F , t’, t , id. Air)∧A(id. Air, name’, t ) → ∃id. C , Arr(t , id. F , id. C ) ∧ Co(id. C , name’, t ) 0 1 2 2 3 0 2 2 4 31

Join refinement between variables of a tgd Well-formed partitions are equipped with an upper semilattice structure: given two partitions P and P’ , if P ≤ P’ and P is wellformed, then P’ is well-formed as well. In particular, if P ≤ P’ then all unifications encoded by P are also performed encoded in P’. We employ these criteria to prune the search space during the exploration of the semilattice of occurrences of x. (x is a variable in a tgd σ = φ → ψ). 32

Example 8 Following Example, t 3 and t 4 must be in a partition containing either t 1 or t 2. This means that partitions containing one of the blocks {t 3}, {t 4} or {t 3, t 4} are not well-formed and will be excluded. F(id. F , t’, t , id. Air)∧A(id. Air, name’, t ) → ∃id. C , Arr(t , id. F , id. C ) ∧ Co(id. C , name’, t ) 0 1 2 2 3 0 2 2 4 {{t 1; t 3}; {t 2; t 4}}, {{t 1; t 4}; {t 2; t 3}}, {{t 1; t 3; t 4}; {t 2}}, {{t 1}; {t 2; t 3; t 4}}and{{t 1; t 2; t 3; t 4}} 33

Example 9 F (id. F , t’ , t, id. Air) ∧ A(id. Air, name’ , t) →∃id. C , Arr(t, id. F , id. C ) ∧ Co(id. C , name’, t) 0 2 2 F (id. F , t’ , t, id. Air 1) ∧ A(id. Air 2, name’ , t) →∃id. C , Arr(t, id. F , id. C ) ∧ Co(id. C , name’, t) 0 2 2 The semilattice contains two partitions {{id. Air 1 } ; {id. Air 2 }} and {{id. Air 1 ; id. Air 2 }} 34

Example 9 F(id. F , t’, t , id. Air)∧A(id. Air, name’, t ) → ∃id. C , Arr(t , id. F , id. C ) ∧ Co(id. C , name’, t ) 0 1 2 2 3 0 2 2 4 {{t 1; t 3}; {t 2; t 4}}, {{t 1; t 4}; {t 2; t 3}}, {{t 1; t 3; t 4}; {t 2}}, {{t 1}; {t 2; t 3; t 4}}and{{t 1; t 2; t 3; t 4}} Yes “Are the tuples F (f 0, Miami, L. A. 1 , a 1) and A(a 1, AF, L. A. 2 ) enough to produce Arr(A 1, L. A. 1, f 0, P 2) and Co(P 2, AF, L. A. 2)? ” F(id. F , t’, t , id. Air)∧A(id. Air, name’, t ) → ∃id. C , Arr(t , id. F , id. C ) ∧ Co(id. C , name’, t ) 0 1 2 2 1 0 2 2 2 35

Example 10 σ = F(id. F , t’, t , id. Air)∧A(id. Air, name’, t ) → ∃id. C , Arr(t , id. F , id. C ) ∧ Co(id. C , name’, t ) 0 1 2 2 1 0 2 2 2 {{t 1; t 3}; {t 2; t 4}}, {{t 1; t 4}; {t 2; t 3}}, {{t 1; t 3; t 4}; {t 2}}, {{t 1}; {t 2; t 3; t 4}}} 36

Final mapping 37

EXPERIMENTS • Their experimental study has three main objectives: i. to study the effectiveness of interactivity under different exploration strategies of the search space, ii. to evaluate the benefit of using exemplar tuples with respect to universal solutions for mapping refinement, and iii. to provide a comparative analysis with [7]. 38

Experimental settings • We have implemented our framework using OCaml 4. 03 on a 2. 6 GHz 4 -core, 16 Gb laptop running Fedora 24. • We have borrowed mappings from seven real integration scenarios of the i. Bench benchmark [9]. with V being the set of distinct variables and Nv the number of occurrences of each v variable within the tgds. 39

Methodology • In all experiments, we consider the i. Bench mapping scenarios as the ideal mappings that the user has in mind. • Starting from these mapping scenarios, we construct exemplar tuples as follows. Each tgd σ ∈ Σ of the form φ → ψ is transformed into a pair of instances (Esσ, ETσ), Esσ (ETσ , resp. ) being generated by replacing each atom in φ (ψ, resp. ) by its tuple counterpart with freshly picked constants for each variable in the tgd. Thus, for each sce- nario Σ = {σ1, . . . , σn}, we obtain a set of exemplar tuples EΣ = {(Esσ1, ETσ1), . . . , (Esσn, ETσn)}. 40

Example 11 By applying the degradation procedure on the tgd σ, the following exemplar tuples may be yielded . 41

Impact of Mapping Refinement • Impact of Mapping Refinement. In the first experiment, we gauge the effectiveness of our interactive approach with four exploration strategies. • • BUBF（Bottom-Up Breadth-First） BUDF(Bottom-Up Depth-First) TDBF(Top-Down Breadth-First) TDDF(Top-Down Depth-First) 42

43

44

45

Impact of Mapping Refinement • They measured the running time of TDBF and BUBF as the sum of the time for lattice exploration between two questions and the time to generate a new question. • In all experiments such runtime steadily stays below 26 ms per question, with an average of 3. 3 ms across all questions. 46

Benefit of (non-universal) exemplar tuples • Our second experiment aims to evaluate the benefit of using exemplar tuples as opposed to universal examples adopted in [7] for the mapping inference process. 47

Relative benefit of interactivity • In this section, we aim at quantifying this benefit via a comparison with a baseline approach, i. e. , the one in which refinement steps are disabled. • As a baseline, we adopted the canonical GLAV generation performed in EIRENE. • They use the sum of the number of left-hand side atoms of the tgds as the comparison criterion. 48

RELATED WORK • A pioneering work on the usage of data examples in mapping understanding and refinement [31] relies on Clio’s [25] schema correspondences as specified in a graphical user interface. • Our method requires as inputs a source and target exemplar tuples and no prior mapping connecting them. • As in Yan et al. [31], Muse [5] leverages data examples to differentiate between alternative mapping specifications of the designer and drives the mapping design process based on the designer’s actions. • In our approach, we do not assume prior knowledge of the schema constraints. 49

RELATED WORK • The use of data examples as evaluation tools has begun in [4, 29], which investigated the possibility of uniquely characterizing a schema mapping by means of a set of data examples. • The only previous work targeting non-expert users is MWeaver [26], where the user is asked to toss tuples in the target instance by fetching constants within the available complete source instance. • Cate et al. [14] show computational learning can be used to infer mappings from data examples. 50

RELATED WORK • Besides mapping specification and learning, researchers have investigated the problem of inferring relational queries [2, 1, 24, 13]. • Their goal is to disambiguate a natural language specification of the query, whereas we use raw tuples to guess the unknown mapping that the user has in mind. • [24] presents the exemplar query evaluation paradigm, which relies on exemplar queries to identify a user sample of the desired result of the query and a similarity function to identify database structures that are similar to the user sample. 51

CONCLUSIONS • They have addressed the problem of interactive schema mapping inference starting from arbitrary sets of exemplar tuples, as provided by non-expert users. • They have shown that simplification of the mappings is possible by alternating normalization and refinement steps, • A further direction of future work is devoted to enhance the lattice exploration, for instance by leveraging machine learning methods. 52