CSE 636 Data Integration Approaches Virtual Integration Architecture

  • Slides: 28
Download presentation
CSE 636 Data Integration Approaches

CSE 636 Data Integration Approaches

Virtual Integration Architecture • Leave the data in the sources • When a query

Virtual Integration Architecture • Leave the data in the sources • When a query comes in: – Determine the relevant sources to the query – Break down the query into sub-queries for the sources – Get the answers from the sources, filter them if needed and combine them appropriately • Data is fresh • Otherwise known as On Demand Integration 2

Virtual Integration Architecture Run-Time Design-Time Mapping Tool Mediation Language Mediator Web Services 1 Query

Virtual Integration Architecture Run-Time Design-Time Mapping Tool Mediation Language Mediator Web Services 1 Query Reformulation Query Result End User Global Schema Optimization & Execution XML Wrapper Local Data Schema Source 3

Virtual Integration Architecture Run-Time Design-Time Mapping Tool Mediation Language Mediator 2 Web Services 1

Virtual Integration Architecture Run-Time Design-Time Mapping Tool Mediation Language Mediator 2 Web Services 1 Query Reformulation Query Result End User Global Schema Optimization & Execution XML Wrapper Local Data Schema Source 4

Virtual Integration Architecture Run-Time Design-Time Mapping Tool 3 Mediation Language Mediator 2 Web Services

Virtual Integration Architecture Run-Time Design-Time Mapping Tool 3 Mediation Language Mediator 2 Web Services 1 Query Reformulation Query Result End User Global Schema Optimization & Execution XML Wrapper Local Data Schema Source 5

Virtual Integration Architecture Run-Time Design-Time 4 Mapping Tool 3 Mediation Language Mediator 2 Web

Virtual Integration Architecture Run-Time Design-Time 4 Mapping Tool 3 Mediation Language Mediator 2 Web Services 1 Query Reformulation Query Result End User Global Schema Optimization & Execution XML Wrapper Local Data Schema Source 6

Virtual Integration Architecture Run-Time Design-Time 4 Mapping Tool 3 Mediation Language Mediator 2 Web

Virtual Integration Architecture Run-Time Design-Time 4 Mapping Tool 3 Mediation Language Mediator 2 Web Services 1 5 Query Result End User Global Schema Query Reformulation Optimization & Execution XML Wrapper Local Data Schema Source 7

Virtual Integration Architecture Run-Time Design-Time 4 Mapping Tool 3 Query Result End User Mediation

Virtual Integration Architecture Run-Time Design-Time 4 Mapping Tool 3 Query Result End User Mediation Language Mediator 2 Web Services 1 Global Schema 5 Query Reformulation 6 Optimization & Execution XML Wrapper Local Data Schema Source 8

Virtual Integration Approaches Dimensions to Consider: • How many sources are we accessing? •

Virtual Integration Approaches Dimensions to Consider: • How many sources are we accessing? • How autonomous are they? • Meta-data about sources? • Is the data structured? • Queries or also updates? • Requirements: accuracy, completeness, performance, handling inconsistencies. • Closed world assumption vs. open world? 9

Mediation Languages Global Schema CD ASIN Title Genre … Artist ASIN Name … Logic

Mediation Languages Global Schema CD ASIN Title Genre … Artist ASIN Name … Logic CDs Books CDCategories Book. Categories Album ASIN Price Discount. Price Studio ASIN Category Title ISBN Price Discount. Price Edition ISBN Category Authors ISBN First. Name Last. Name Artists ASIN Artist. Name Group. Name 10

Desiderata from Source Descriptions • Expressive power: distinguish between sources with closely related data.

Desiderata from Source Descriptions • Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources. • Easy addition: make it easy to add new data sources. • Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively. 11

Reformulation Problem Given: • A query Q posed over the global schema • Descriptions

Reformulation Problem Given: • A query Q posed over the global schema • Descriptions of the data sources Find: • A query Q’ over the data source relations, such that: – Q’ provides only correct answers to Q, and – Q’ provides all possible answers from to Q given the sources. 12

Languages for Schema Mapping Q Global Mediator. Schema Mediated GAV LAV Q’ Q’ GLAV

Languages for Schema Mapping Q Global Mediator. Schema Mediated GAV LAV Q’ Q’ GLAV Q’ Q’ Q’ Source Source Local Schema Local Schema 13

Global-as-View (GAV) Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Integrating View: Create

Global-as-View (GAV) Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Integrating View: Create View Movie AS SELECT * FROM S 1 [S 1(title, dir, year, genre)] union SELECT * FROM S 2 [S 2(title, dir, year, genre)] union SELECT S 3. title, S 3. dir, S 4. year, S 4. genre FROM S 3, S 4 [S 3(title, dir), WHERE S 3. title = S 4. title S 4(title, year, genre)] 14

Global-as-View: Example 2 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Integrating View:

Global-as-View: Example 2 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Integrating View: Create View Movie AS SELECT title, dir, year, NULL FROM S 1 [S 1(title, dir, year)] union SELECT title, dir, NULL, genre FROM S 2 [S 2(title, dir, genre)] 15

Global-as-View: Example 3 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Integrating Views:

Global-as-View: Example 3 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Integrating Views: Create View Movie AS SELECT NULL, genre FROM S 4 [S 4(cinema, genre)] Create View Schedule AS SELECT cinema, NULL FROM S 4 [S 4(cinema, genre)] But what if we want to find which cinemas are playing comedies? 16

Global-as-View Summary • Query reformulation boils down to view unfolding. • Very easy conceptually.

Global-as-View Summary • Query reformulation boils down to view unfolding. • Very easy conceptually. • Can build hierarchies of global schemas. • You sometimes loose information. Not always natural. • Adding sources is hard. Need to consider all other sources that are available. 17

Local-as-View (LAV) Create SELECT FROM WHERE AND View R 1 AS B. ISBN, B.

Local-as-View (LAV) Create SELECT FROM WHERE AND View R 1 AS B. ISBN, B. Title, A. Name Book B, Author A A. ISBN = B. ISBN B. Year < 1970 Mediator Create View R 5 AS SELECT B. ISBN, B. Title FROM Book B WHERE B. Genre = ‘Humor’ Global Schema Books before 1970 ISBN Title Genre Year Author Humor Books ISBN Name Mediated Schema Source 1 Source 2 Source 3 Source 4 Source 5 Local Schema Local Schema R 1 ISBN Title Name R 5 ISBN Title 18

Query Reformulation Query: Find authors of humor books Plan: R 1 Join R 5

Query Reformulation Query: Find authors of humor books Plan: R 1 Join R 5 Mediator Global Schema Books before 1970 ISBN Title Genre Year Author Humor Books ISBN Name Mediated Schema Source 1 Source 2 Source 3 Source 4 Source 5 Local Schema Local Schema R 1 ISBN Title Name R 5 ISBN Title 19

Query Reformulation Query: Find authors of humor books before 1960 Plan: Can’t do it!

Query Reformulation Query: Find authors of humor books before 1960 Plan: Can’t do it! Mediator Global Schema Books before 1970 ISBN Title Genre Year Author Humor Books ISBN Name Mediated Schema Source 1 Source 2 Source 3 Source 4 Source 5 Local Schema Local Schema R 1 ISBN Title Name R 5 ISBN Title 20

Local-as-View: Example 1 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Source Views:

Local-as-View: Example 1 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Source Views: Create Source S 1 AS [S 1(title, dir, year, genre)] SELECT * FROM Movie Create Source S 3 AS [S 3(title, dir)] SELECT title, dir FROM Movie Create Source S 5 AS [S 5(title, dir, year)] SELECT title, dir, year FROM Movie WHERE year > 1960 AND genre=‘Comedy’ 21

Local-as-View: Example 2 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Source Views:

Local-as-View: Example 2 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Source Views: Create Source S 4 [S 4(cinema, genre)] SELECT cinema, genre FROM Movie M, Schedule S WHERE M. title=S. title Now if we want to find which cinemas are playing comedies, there is hope! 22

Local-as-View Summary • Very flexible. You have the power of the entire query language

Local-as-View Summary • Very flexible. You have the power of the entire query language to define the contents of the source. • Hence, can easily distinguish between contents of closely related sources. • Adding sources is easy: they’re independent of each other. • Query reformulation: answering queries using views! 23

The General Problem • Given a set of views V 1, …, Vn, and

The General Problem • Given a set of views V 1, …, Vn, and a query Q, can we answer Q using only the answers to V 1, …, Vn? • Many, many papers on this problem • The best performing algorithm: The Mini. Con. Algorithm (Pottinger & Halevy, VLDB 2000) 24

Local Completeness Information • If sources are incomplete, we need to look at each

Local Completeness Information • If sources are incomplete, we need to look at each one of them. • Often, sources are locally complete. • Movie(title, director, year) complete for years after 1960, or for American directors. • Question: given a set of local completeness statements, is a query Q’ a complete answer to Q? 25

Example • Movie(title, director, year) – complete after 1960 • Show(title, theater, city, hour)

Example • Movie(title, director, year) – complete after 1960 • Show(title, theater, city, hour) • Query: find movies (and directors) playing in Seattle: SELECT M. title, M. director FROM Movie M, Show S WHERE M. title=S. title AND city=‘Seattle’ • Complete or not? 26

Example #2 • Movie(title, director, year), Oscar(title, year) • Query: find directors whose movies

Example #2 • Movie(title, director, year), Oscar(title, year) • Query: find directors whose movies won Oscars after 1965: SELECT M. director FROM Movie M, Oscar O WHERE M. title=O. title AND M. year=O. year AND O. year > 1965 • Complete or not? 27

References • Information integration – Maurizio Lenzerini – Eighteenth International Joint Conference on Artificial

References • Information integration – Maurizio Lenzerini – Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003 – Invited Tutorial • Data Integration: a Status Report – Alon Halevy – German Database Conference (BTW), 2003 – Invited Talk 28