CSE 636 Data Integration Overview Data Warehouse Architecture

  • Slides: 22
Download presentation
CSE 636 Data Integration Overview

CSE 636 Data Integration Overview

Data Warehouse Architecture Users Applications Relational Database (Warehouse) OLAP / Decision Support Data Cubes

Data Warehouse Architecture Users Applications Relational Database (Warehouse) OLAP / Decision Support Data Cubes / Data Mining ETL Tools (Extract-Transform-Load) Data Cleaning Data Source 2

Virtual Integration Architecture • Leave the data in the sources • When a query

Virtual Integration Architecture • Leave the data in the sources • When a query comes in: – Determine the relevant sources to the query – Break down the query into sub-queries for the sources – Get the answers from the sources, filter them if needed and combine them appropriately • Data is fresh • Otherwise known as On Demand Integration 3

Virtual Integration Architecture Design-Time End Users Applications Global Schema Mappings Local Data Schema Sources

Virtual Integration Architecture Design-Time End Users Applications Global Schema Mappings Local Data Schema Sources can be: • Relational DBs • Excel Files • Web Sites • Web Services Local Data Schema Source 4

Schema Mappings • Differences in: – Names in schema – Attribute grouping Inventory Database

Schema Mappings • Differences in: – Names in schema – Attribute grouping Inventory Database A Inventory Database B Books Title ISBN Price Discount. Price Edition ISBN First. Name Last. Name Book. Categories ISBN Category Books. And. Music Title Author Publisher Item. ID Item. Type Suggested. Price Categories Keywords Authors CDCategories ASIN Category CDs Album ASIN Price Discount. Price Studio – Coverage of databases – Granularity and format of attributes Artists ASIN Artist. Name Group. Name 5

Issues for Schema Mappings Design-Time End Users Applications Global Schema Mappings Local Data Schema

Issues for Schema Mappings Design-Time End Users Applications Global Schema Mappings Local Data Schema Source • What formalisms to express them? • How to create them? • Can we discover them somehow? • How do we use them? Local Data Schema Source 6

Virtual Integration Architecture Run-Time Query Result Mediator Reformulation Optimization Global Schema Execution Wrapper Local

Virtual Integration Architecture Run-Time Query Result Mediator Reformulation Optimization Global Schema Execution Wrapper Local Data Schema Source 7

Issues for Query Processing Reformulation Query Mediator Reformulation Global Schema Local Data Schema Source

Issues for Query Processing Reformulation Query Mediator Reformulation Global Schema Local Data Schema Source • User queries refer to the global schema • Data is stored in the sources in a local schema • Rewriting algorithms Local Data Schema Source 8

Issues for Query Processing Reformulation Global Schema Books Title ISBN Price Discount. Price Edition

Issues for Query Processing Reformulation Global Schema Books Title ISBN Price Discount. Price Edition SELECT ISBN, Price FROM Books WHERE Title = ‘on the road’ Local Schema A Books. And. Music Title Author Publisher Item. ID Item. Type Suggested. Price Categories Keywords SELECT Item. ID, Suggested. Price FROM Books. And. Music WHERE Title = ‘on the road’ AND Item. Type = ‘Books’ 9

Issues for Query Processing Query Translation Query • Different query languages Mediator Reformulation Optimization

Issues for Query Processing Query Translation Query • Different query languages Mediator Reformulation Optimization Global Schema Execution Wrapper Local Data Schema Source 10

Issues for Query Processing Query Translation Global Schema Books Title ISBN Price Discount. Price

Issues for Query Processing Query Translation Global Schema Books Title ISBN Price Discount. Price Edition SELECT ISBN, Price FROM Books WHERE Title = ‘on the road’ Local Source A http: //www. amazon. com/homepage. html? Item. Type=Books&Title=on+the+road 11

Issues for Query Processing Data Translation Query • Different data models Mediator Reformulation Optimization

Issues for Query Processing Data Translation Query • Different data models Mediator Reformulation Optimization Global Schema Execution Wrapper Local Data Schema Source 12

Issues for Query Processing Data Translation Global Schema Books Title ISBN Price Discount. Price

Issues for Query Processing Data Translation Global Schema Books Title ISBN Price Discount. Price Edition Title ISBN Price On the Road 123 10. 86 Local Result A … … <table> <tr> <td> <a href=/details? isbn=123> <b>On the Road</b> </a> -- by Jack Kerouac; Paperback <a href=/details? isbn=123> Buy new </a> : <b class=price>$10. 86</b> </td> </tr> </table> 13

Issues for Query Processing Query Execution Query Mediator Reformulation Optimization Global Schema Execution Wrapper

Issues for Query Processing Query Execution Query Mediator Reformulation Optimization Global Schema Execution Wrapper Local Data Schema Source • Access as many data sources as needed • Duplicate/redundant and irrelevant data • Limited query capabilities Wrapper Local Data Schema Source 14

Issues for Query Processing Limited Query Capabilities Global Schema Books Title E ISBN Price

Issues for Query Processing Limited Query Capabilities Global Schema Books Title E ISBN Price 123 10. 86 Discount. Price Edition SELECT ISBN, Price, Discount. Price FROM Books WHERE Title = ‘on the road’ Discount. Price 8. 86 Local Schema B Local Schema A Discount. Books. And. Music Title B Item. ID Author Suggested. Price Item. ID 123 Item. Type 10. 86 Suggested. Price A SELECT Item. ID, Suggested. Price FROM Books. And. Music WHERE Title = ‘on ? the road’ Title D Great. Price Edition ISBN 8. 86 Great. Price C SELECT Great. Price FROM Discount. Books WHERE ISBN = 123 ? 15

Issues for Query Processing Query Answering Query Result Mediator Reformulation Optimization Global Schema Execution

Issues for Query Processing Query Answering Query Result Mediator Reformulation Optimization Global Schema Execution Wrapper Local Data Schema Source • Combine the results and further process them if needed • Mainly union and merge • Inconsistencies Wrapper Local Data Schema Source 16

Issues for Query Processing Query Answering (Union) ISBN Price 123 10. 86 456 8.

Issues for Query Processing Query Answering (Union) ISBN Price 123 10. 86 456 8. 86 Item. ID Suggested. Price ISBN Great. Price 123 10. 86 456 8. 86 17

Issues for Query Processing Query Answering (Merge) Primary Key ISBN Title Edition Price 123

Issues for Query Processing Query Answering (Merge) Primary Key ISBN Title Edition Price 123 On the Road 2 nd 8. 86 Item. ID Title ISBN Edition Price 123 On the Road 123 2 nd 8. 86 Primary Key 18

Issues for Query Processing Query Answering (Inconsistencies) Primary Key ISBN Title Edition Price 123

Issues for Query Processing Query Answering (Inconsistencies) Primary Key ISBN Title Edition Price 123 On the Road ? ? ? 8. 86 Item. ID Title Edition ISBN Edition Price 123 On the Road 1 st 123 2 nd 8. 86 Primary Key 19

Peer-Based Integration Query Peer 4 Query Peer 5 Peer 2 Peer 1 Peer 3

Peer-Based Integration Query Peer 4 Query Peer 5 Peer 2 Peer 1 Peer 3 21

Peer-Based Integration • • No need for a central mediated schema Peers serve as

Peer-Based Integration • • No need for a central mediated schema Peers serve as mediators for other peers A peer can be both a server and a client Semantic relationships are specified locally (between small sets of peers) • Queries are posed using the peer’s schema • Answers come from anywhere in the system • This is not P 2 P file sharing. – Data has rich semantics 22

References • Information integration – Maurizio Lenzerini – Eighteenth International Joint Conference on Artificial

References • Information integration – Maurizio Lenzerini – Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003 – Invited Tutorial • Data Integration: a Status Report – Alon Halevy – German Database Conference (BTW), 2003 – Invited Talk 23