Information Integration Mediators Semistructured Data Answering Queries Using

  • Slides: 48
Download presentation
Information Integration Mediators Semi-structured Data Answering Queries Using Views 1

Information Integration Mediators Semi-structured Data Answering Queries Using Views 1

Importance of Information Integration u. Very many modern DB applications involve combining databases. u.

Importance of Information Integration u. Very many modern DB applications involve combining databases. u. Sometimes a “database” is not stored in a DBMS --- it could be a spreadsheet, flat file, XML document, etc. 2

Example Applications 1. Enterprise Information Integration: making separate DB’s, all owned by one company,

Example Applications 1. Enterprise Information Integration: making separate DB’s, all owned by one company, work together. 2. Scientific DB’s, e. g. , genome DB’s. 3. Catalog integration: combining product information from all your suppliers. 4. Etc. , etc. 3

Challenges 1. Legacy databases : DB’s get used for many applications. u You can’t

Challenges 1. Legacy databases : DB’s get used for many applications. u You can’t change its structure for the sake of one application, because it will cause others to break. 2. Incompatibilities : Two, supposedly similar databases, will mismatch in many ways. 4

Examples: Incompatibilities u. Lexical : addr in one DB is address in another. u.

Examples: Incompatibilities u. Lexical : addr in one DB is address in another. u. Value mismatches : is a “red” car the same color in each DB? Is 20 degrees Fahrenheit or Centigrade? u. Semantic : are “employees” in each database the same? What about consultants? Retirees? Contractors? 5

What Do You Do About It? u. Grubby, handwritten translation at each interface. w

What Do You Do About It? u. Grubby, handwritten translation at each interface. w Some research on automatic inference of relationships. u. Wrapper (aka “adapter”) translates incoming queries and outgoing answers. 6

Integration Architectures 1. Federation : everybody talks directly to everyone else. 2. Warehouse :

Integration Architectures 1. Federation : everybody talks directly to everyone else. 2. Warehouse : Sources are translated from their local schema to a global schema and copied to a central DB. 3. Mediator : Virtual warehouse --turns a user query into a sequence of source queries. 7

Federations Wrapper Wrapper 8

Federations Wrapper Wrapper 8

Warehouse Diagram Warehouse Wrapper Source 1 Source 2 9

Warehouse Diagram Warehouse Wrapper Source 1 Source 2 9

A Mediator Result User query Mediator Query Result Wrapper Query Result Source 1 Query

A Mediator Result User query Mediator Query Result Wrapper Query Result Source 1 Query Wrapper Query Result Source 2 10

Two Mediation Approaches 1. Query-centric : Mediator processes queries into steps executed at sources.

Two Mediation Approaches 1. Query-centric : Mediator processes queries into steps executed at sources. 2. View-centric : Sources are defined in terms of global relations; mediator finds all ways to build query from views. 11

Example u. Suppose Dell wants to buy a bus and a disk that share

Example u. Suppose Dell wants to buy a bus and a disk that share the same protocol. u. Global schema: Buses(manf, model, protocol) Disks(manf, model, protocol) u. Local schemas: each bus or disk manufacturer has a (model, protocol) relation --- manf is implied. 12

Example: Query-Centric u. Mediator might start by querying each bus manufacturer for model-protocol pairs.

Example: Query-Centric u. Mediator might start by querying each bus manufacturer for model-protocol pairs. w The wrapper would turn them into triples by adding the manf component. u. Then, for each protocol returned, mediator queries disk manufacturers for disks with that protocol. w Again, wrapper adds manf component. 13

Example: View-Centric u. Sources’ capabilities are defined in terms of the global predicates. w

Example: View-Centric u. Sources’ capabilities are defined in terms of the global predicates. w E. g. , Quantum’s disk database could be defined by Quantum. View(M, P) = Disks(’Quantum’, M, P). u. Mediator discovers all combinations of a bus and disk “view, ” equijoined on the protocol components. 14

Comparison u. Query-centric is simpler to implement. w Lets you have control of what

Comparison u. Query-centric is simpler to implement. w Lets you have control of what the mediator does. u. View centric is more extensible. w Same query engine works for any number of sources. w Add a new source simply by defining what it contributes as a view of the global schema. 15

Semi-structured Data u. A data model that is suited for integrating (slightly) incompatible sources.

Semi-structured Data u. A data model that is suited for integrating (slightly) incompatible sources. u. Based on labeled graphs. u. Key attribute: flexibility --- there is no schema; sources do not all need to have the same attributes. 16

Semistructured Data --- (2) u. Use semistructured data in place of the global schema.

Semistructured Data --- (2) u. Use semistructured data in place of the global schema. w Easier to translate sources with varying local schemas into one flexible schema. u. XML and its attendant standards (XSL, XQUERY, etc. ) are really an implementation of semistructured data. 17

Example: Semistructured Data Notice unusual data root beer bar beer manf name served. At

Example: Semistructured Data Notice unusual data root beer bar beer manf name served. At name Joe’s Bud A. B. manf prize name M’lob year 1995 award Gold addr Maple St. The bar object for Joe’s Bar The beer object for Bud 18

XML and Semistructured Data u. XML (Extensible Markup Language) uses a semistructured data model

XML and Semistructured Data u. XML (Extensible Markup Language) uses a semistructured data model to represent documents. Example: <BARDOC><BAR><NAME>Joe’s</NAME> <ADDR>Maple St. </ADDR></BAR> <BAR> … </BARDOC> 19

Semistructured Data and Logic u. You can represent a semistructured data graph (or XML

Semistructured Data and Logic u. You can represent a semistructured data graph (or XML document) as relations or predicates: w arcs(From, To, Label) w data(Node, Value) u. But queries about paths in the graph become complex joins. 20

More Likely Alternative u. Store XML documents as strings, either independent or as components

More Likely Alternative u. Store XML documents as strings, either independent or as components of tuples. u. But the problem of integrating into a sensible whole remains. u. So does the problem of deciding the best way to answer a query. 21

View-Centric Mediation u Key assumptions: 1. There is a set of global predicates that

View-Centric Mediation u Key assumptions: 1. There is a set of global predicates that define the schema. u These do not exist as stored relations. 2. Each data source has its capabilities defined by views, which are (typically) CQ’s whose subgoals involve the global predicates. 22

Assumptions --- Continued 3. A query is (typically) a CQ over the global predicates.

Assumptions --- Continued 3. A query is (typically) a CQ over the global predicates. 4. A solution is an expression (union of CQ’s, typically) involving the views. w Ideally, the solution is equivalent to the query. w In practice, we have to be happy with a solution maximally contained in the query. 23

Interpretation of Views u. A view describes (some of) the facts that are available

Interpretation of Views u. A view describes (some of) the facts that are available at the source. u. A view does not define exactly what is at the source. w Example: a view v(X) : - p(X, 10) says that the source has some p -facts with second component 10 --- v could even be empty although p(X, 10) is not. 24

Put Another Way … u. The : - separator between head and body of

Put Another Way … u. The : - separator between head and body of a view definition should not be interpreted as “if. ” u. Rather, it is “only if. ” 25

Example u Global predicates: emp(E) = “E is an employee. ” phone(E, P) =

Example u Global predicates: emp(E) = “E is an employee. ” phone(E, P) = “P is a phone of E. ” office(E, O) = “O is an office of E. ” mgr(E, M) = “M is E’s manager. ” dept(E, D) = “D is E’s department. ” 26

Example --- Continued u. Three sources each provide one view: At source S 1:

Example --- Continued u. Three sources each provide one view: At source S 1: view v 1(E, P, M) defined by: v 1(E, P, M) : - emp(E) & phone(E, P) & mgr(E, M) w Interpretation: “every triple (e, p, m) at S 1 is an employee, one of their phones, and their manager. ” w It does not say “S 1 has all E-P-M facts. ” 27

Example: Sources S 2 and S 3 u. At S 2: v 2(E, O,

Example: Sources S 2 and S 3 u. At S 2: v 2(E, O, D) : - emp(E) & office(E, O) & dept(E, D) w S 2 has (some of the) employee-officedepartment facts. u. At S 3: v 3(E, P) : - emp(E) & phone(E, P) & dept(E, ‘toy’) w S 3 has (some) toy-department phones. 28

Example: A Query q 1(P, O) : - phone(’sally’, P) & office(’sally’, O) w

Example: A Query q 1(P, O) : - phone(’sally’, P) & office(’sally’, O) w Find Sally’s office and phone. u. There are two useful solutions: s 1(P, O) : - v 1(’sally’, P, M) & v 2(’sally’, O, D) s 2(P, O) : - v 3(’sally’, P) & v 2(’sally’, O, D) 29

What Makes a Solution S Useful? 1. There must be no other solution containing

What Makes a Solution S Useful? 1. There must be no other solution containing S. 2. S, when expanded from views into global predicates, is contained in the query. 30

Expanding Views u Suppose we have a subgoal v(X, Y) in a solution, and

Expanding Views u Suppose we have a subgoal v(X, Y) in a solution, and v is defined by: v(A, B) : - p(A, X) & q(X, B) 1. Find unique variables for the local variables of the view (those that appear only in the body). 2. Substitute variables of the subgoal for variables of the head. 3. Use the resulting body as the substitution. 31

Example v(A, B) : - p(A, X) & q(X, B) becomes: v(A, B) :

Example v(A, B) : - p(A, X) & q(X, B) becomes: v(A, B) : - p(A, X 1) & q(X 1, B) Then substitute A->X, B->Y; yields body: p(X, X 1) & q(X 1, Y) 32

Important Points u. To test containment of a solution in a query, we expand

Important Points u. To test containment of a solution in a query, we expand the solution first, then test CQ containment of the expansion in the query. u. The view definition describes what any tuples of the view look like, so CQ containment implies that the solution will provide only true answers. 33

The Picture Query: q(X, Y) : - p(X, Z) & … Soln: q(A, B)

The Picture Query: q(X, Y) : - p(X, Z) & … Soln: q(A, B) : - v(A, C, D) & w(B, E) & … Exp: q(A, B) : - p(A, U) & … & r(B, V) & … Is there a containment mapping? 34

Important Points --- (2) u. There is no guarantee a solution supplies any answers

Important Points --- (2) u. There is no guarantee a solution supplies any answers to the query. u. Comparing different solutions by testing if one solution is contained in another must be done at the level of the unexpanded views. 35

Example u. Two sources might have similar views, defined by: v 1(X, Y) :

Example u. Two sources might have similar views, defined by: v 1(X, Y) : - p(X, Y) v 2(X, Y) : - p(X, Y) u. But the sources actually have different sets of p -facts. 36

Example --- Continued u. Then, the two solutions: s 1(X, Y) : - v

Example --- Continued u. Then, the two solutions: s 1(X, Y) : - v 1(X, Y) s 2(X, Y) : - v 2(X, Y) have the same expansions, p(X, Y), but there is no reason to believe one solution is contained in the other. w One view could provide lots of p -facts, the other, few or none. 37

Important Points --- (3) u. On the other hand, when one solution, unexpanded, is

Important Points --- (3) u. On the other hand, when one solution, unexpanded, is contained in another, we can be sure the first provides no answers the second does not. 38

Example u. Here are two solutions: s 1(X, Y) : - v 1(X, Z)

Example u. Here are two solutions: s 1(X, Y) : - v 1(X, Z) & v 2(Z, Y) s 2(X, Y) : - v 1(X, Z) & v 2(W, Y) u. There is a containment mapping s 2 -> s 1. w Thus, s 1 s 2 at the level of views. u. No matter what tuples v 1 and v 2 represent, s 2 provides all answers s 1 provides. 39

The Office Example q 1(P, O) : - phone(’sally’, P) & office(’sally’, O) v

The Office Example q 1(P, O) : - phone(’sally’, P) & office(’sally’, O) v 1(E, P, M) : - emp(E) & phone(E, P) & mgr(E, M) v 2(E, O, D) : - emp(E) & office(E, O) & dept(E, D) v 3(E, P) : - emp(E) & phone(E, P) & dept(E, ‘toy’) 40

Office Example --- Solutions s 1(P, O) : - v 1(’sally’, P, M) &

Office Example --- Solutions s 1(P, O) : - v 1(’sally’, P, M) & v 2(’sally’, O, D) s 2(P, O) : - v 3(’sally’, P) & v 2(’sally’, O, D) 41

Expansion of S 1 e 1(P, O) : - emp(’sally’) & phone(’sally’, P) &

Expansion of S 1 e 1(P, O) : - emp(’sally’) & phone(’sally’, P) & mgr(’sally’, M) & emp(’sally’) & office(’sally’, O) & dept(’sally’, D) q 1(P, O) : - phone(’sally’, P) & office(’sally’, O) Containment mapping q 1 ->e 1 42

Office Example --- Concluded u. Mapping from q 1 to s 2 is similar.

Office Example --- Concluded u. Mapping from q 1 to s 2 is similar. u. Notice we have used the head predicate to name the solution, expansion, etc. w Technically, head predicates have to be the same, but that’s not a problem here. u. Expansions are properly contained in query --- not equivalent. 43

Finding All Solutions to a Query u. Key idea: LMSS (Levy-Mendelzon-Sagiv. Srivastava) test. u.

Finding All Solutions to a Query u. Key idea: LMSS (Levy-Mendelzon-Sagiv. Srivastava) test. u. If a query has n subgoals, then we only need to consider solutions with at most n subgoals. w Any other solution must be contained in one with < n subgoals. 44

Proof of LMSS Theorem u. Suppose the query has n subgoals, and a solution

Proof of LMSS Theorem u. Suppose the query has n subgoals, and a solution S has >n subgoals. u. Look at the expansion diagram again – at least one subgoal (view) in the solution has an expansion to which no query subgoal maps. 45

Expansion Diagram n of these Query: q(X, Y) : - p(X, Z) & …

Expansion Diagram n of these Query: q(X, Y) : - p(X, Z) & … Soln: q(A, B) : - v(A, C, D) & w(B, E) & … Exp: q(A, B) : - p(A, U) & … & r(B, V) & … More than n of these 46

Proof --- Continued u. Consider the new solution S ’, which removes from S

Proof --- Continued u. Consider the new solution S ’, which removes from S every subgoal whose expansion is not a target of the CM from the query. u. Clearly S S ’. w In general, throwing away subgoals grows the result of the CQ. u. But S ’ has at most n subgoals. 47

Example u. In our running “office” example, we can immediately conclude that the solution

Example u. In our running “office” example, we can immediately conclude that the solution s 3(P, O) : - v 1(‘’sally’, P, M) & v 2(‘’sally’, O, D) & v 3(E, P) is not minimal. w It has more subgoals than the query. w In fact, it is contained in s 1. 48