Data Integration Achievements and Perspectives in the Last

  • Slides: 51
Download presentation
Data Integration: Achievements and Perspectives in the Last Ten Years Ai. Jing

Data Integration: Achievements and Perspectives in the Last Ten Years Ai. Jing

Outline n n n Motivation & Background Best Paper: Information Manifold Building on the

Outline n n n Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

Motivation & Background n Data integration is a pervasive challenge faced in applications that

Motivation & Background n Data integration is a pervasive challenge faced in applications that need to query across multiple autonomous and heterogeneous data sources. n Data integration is crucial in large enterprises that own a multitude of data sources. n For better cooperation among agencies, each with their own data sources.

Data Integration Enterprise Databases Legacy Databases Services and Applications

Data Integration Enterprise Databases Legacy Databases Services and Applications

Outline n n n Motivation & Background Best Paper: Information Manifold Building on the

Outline n n n Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

Ten-Year Best Paper Querying Heterogeneous Information Sources using Source Descriptions. VLDB 96 Alon Halevy

Ten-Year Best Paper Querying Heterogeneous Information Sources using Source Descriptions. VLDB 96 Alon Halevy a principal member of technical staff at AT&T Bell Laboratories, and then at AT&T Laboratories. • Main idea: the Information Manifold • led to tremendous progress on data integration and to quite a few commercial data integration products.

The Information Manifold n An implemented data integration system n Goal: provide a uniform

The Information Manifold n An implemented data integration system n Goal: provide a uniform query interface to a heterogeneous collection of Web data sources n Main contribution: the way it described the contents of the data sources it knew about. n IM contains declarative descriptions of the contents and capabilities of the information sources. (Source Description)

An example of complex query find reviews of movie directed by Woody Allen playing

An example of complex query find reviews of movie directed by Woody Allen playing in my area three web sites join! 1. a movie site containing actor and director information (IMDB) 2. movie playing sources(e. g. , 777 film. com) 3. movie review sites (e. g. , a newspaper)

Design time Run time Mediated Schema query reformulation Semantic mappings optimization & execution wrapper

Design time Run time Mediated Schema query reformulation Semantic mappings optimization & execution wrapper wrapper

Semantic Mappings Mediated Schema CD: ASIN, Title, Genre, … Artist: ASIN, name, … Mapping

Semantic Mappings Mediated Schema CD: ASIN, Title, Genre, … Artist: ASIN, name, … Mapping logic CDs Album ASIN Price Discount. Price Studio CDCategories ASIN Category Books Title ISBN Price Discount. Price Edition Book. Categories ISBN Category Authors ISBN First. Name Last. Name Artists ASIN Artist. Name Group. Name Informatio n sources

Global-as-View (GAV) (Previous approaches) Mapping: Mediated Schema CD: ASIN, Title, Genre, … Artist: ASIN,

Global-as-View (GAV) (Previous approaches) Mapping: Mediated Schema CD: ASIN, Title, Genre, … Artist: ASIN, name, … Source R 1 Source R 2 Source R 3 Source R 4 Source R 5

Local-as-View (LAV) Mapping: Mediated Schema CD: ASIN, Title, Genre, Year Artist: ASIN, Name, …

Local-as-View (LAV) Mapping: Mediated Schema CD: ASIN, Title, Genre, Year Artist: ASIN, Name, … Mediated View Source R 1 Mediated View Source R 2 Mediated View Source R 3 Mediated View Source R 4 Mediated View Source R 5

benefits of LAV n Describing information sources became easier a data integration system could

benefits of LAV n Describing information sources became easier a data integration system could accommodate new sources easily n The descriptions of the information sources could be more precise describe precise constraints on the contents of the sources become easier

Query reformulation Mediated Schema A query posed over CD: ASIN, Title, Genre, … CD(A,

Query reformulation Mediated Schema A query posed over CD: ASIN, Title, Genre, … CD(A, T, G) a set of queries on the data sources CDs Album ASIN Price Discount. Price Studio Books Title ISBN Price Discount. Price Edition Authors ISBN First. Name Last. Name Artists CDCategories ASIN Category Book. Categories ISBN Category ASIN Artist. Name Group. Name

Query Answering in LAV = Answering queries using views (AQUV) n a problem which

Query Answering in LAV = Answering queries using views (AQUV) n a problem which was earlier considered in the context of query optimization Given a set of views V 1, …, Vn, And a query Q, Can we answer Q using only the answers to V 1, …, Vn?

AQUV n Query optimization & Supporting physical data independence n AQUV for data integration:

AQUV n Query optimization & Supporting physical data independence n AQUV for data integration: q Not necessarily equivalent rewriting q Find maximally contained rewriting Main AQUV Algorithms: q Bucket q Inverse rules q Minicon n

Outline n n n Motivation & Background Best Paper: Information Manifold Building on the

Outline n n n Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

Building on the Foundation n n n Generating Schema mappings Adaptive query processing XML

Building on the Foundation n n n Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

Generating Schema Mappings Look at that observation: n q Who’s going to write all

Generating Schema Mappings Look at that observation: n q Who’s going to write all these LAV/GAV formulas (the semantic mappings between the sources and the mediated schema)? 1. create the source descriptions 2. writing the semantic mappings q This was the main bottleneck.

Techniques for Schema Mapping n n n semi-automatically generating schema mappings Goal: create tools

Techniques for Schema Mapping n n n semi-automatically generating schema mappings Goal: create tools that speed up the creation of the mappings and reduce the amount of human effort involved. Compare schema elements based on: q Linguistic similarities q overlaps in data values or data types q schema mapping tasks are often repetitive.

A Machine Learning Approach s Mediated schema e h Predic c t a m

A Machine Learning Approach s Mediated schema e h Predic c t a m t n e n w one e v i s G n n Map multiple schemas in the same domain to the same mediated schema. Learn from previous experience: q q the manually created schema mappings as training data generalize from them to predict mappings between unseen schemas.

Building on the Foundation n n n Generating Schema mappings Adaptive query processing XML

Building on the Foundation n n n Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

Adaptive query processing n look at that observation: q q n Once we have

Adaptive query processing n look at that observation: q q n Once we have mappings, how can we execute queries? Traditional plan-then-execute doesn’t work. Root: the dynamic nature of data integration contexts

Adaptive query processing n n data integration system: the context is very dynamic and

Adaptive query processing n n data integration system: the context is very dynamic and the optimizer has much less information than the traditional setting. Two results: q q n the optimizer can’t decide a good plan a plan may be arbitrarily bad. Dynamic adjust query plan

Building on the Foundation n n n Generating Schema mappings Adaptive query processing XML

Building on the Foundation n n n Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

XML characters for data integration n XML offered a common syntactic format for sharing

XML characters for data integration n XML offered a common syntactic format for sharing data among data sources. since it appeared as if data could actually be shared integration systems using XML as the underlying data Model and XML query languages (XQuery)

Building on the Foundation n n n Generating Schema mappings Adaptive query processing XML

Building on the Foundation n n n Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

Model Management n Goal: provide an algebra for manipulating schemas and mappings n With

Model Management n Goal: provide an algebra for manipulating schemas and mappings n With such an algebra: q n complex operations on data sources simple sequences of operators in the algebra Some of the operators in Model Management q create & compose mappings, merge & diff models

Building on the Foundation n n n Generating Schema mappings Adaptive query processing XML

Building on the Foundation n n n Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

Peer Data Management Systems Q 3 UW (Wisconsin) Stanford Q 1 Q 4 Berkeley

Peer Data Management Systems Q 3 UW (Wisconsin) Stanford Q 1 Q 4 Berkeley Q 5 LAV, GLAV Q UW (Washington) DBLP Q 2 UW (Waterloo) Q 6 Cite. Seer

Two Additional Benefits n A P 2 P architecture offers a truly distributed mechanism

Two Additional Benefits n A P 2 P architecture offers a truly distributed mechanism for sharing data. q q n Every data source only provide semantic mappings to a set of neighbors. complex integrations emerge follows semantic paths P 2 P architecture is more appropriate than a single mediated schema in data sharing context. q q there is never a single global mediated schema data sharing occurs in local neighborhoods of the network.

Building on the Foundation n n n Generating Schema mappings Adaptive query processing XML

Building on the Foundation n n n Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

The Role of Artificial Intelligence n Description Logics describe relationships between data sources q

The Role of Artificial Intelligence n Description Logics describe relationships between data sources q q n n data sources need to be represented declaratively the mediated schema of IM was based on Classic Description Logics offered more flexible mechanisms for representing a mediated schema Recent work: combine the expressive power of Description Logics with the ability to manage large amounts of data.

Outline n n n Motivation & Background Best Paper: Information Manifold Building on the

Outline n n n Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

The Data Integration Industry n n Late 90’s——commercialization Enterprise Information Integration (EII): without having

The Data Integration Industry n n Late 90’s——commercialization Enterprise Information Integration (EII): without having to first load all the data into a central warehouse the development of the EII industry q Technologies from research labs matured enough q The needs of data management q XML Inappropriate: data warehousing solutions, ad-hoc solutions

A data integration scenario Query processing data sources build semantic Execute withmappings an engine

A data integration scenario Query processing data sources build semantic Execute withmappings an engine that create plans that span multiple data mediated schema sources will participate in the application a query posed over the a query reformulation virtual schema data sources query applications

Other EII Products n XML data model and XQuery Challenge: the research on integration

Other EII Products n XML data model and XQuery Challenge: the research on integration for XML was only in its infancy n customer-relationship management Challenge: how to provide the customer-facing worker a global view of a customer whose data is residing in multiple sources, and track information from multiple sources in real time.

Outline n n n Motivation & Background Best Paper: Information Manifold Building on the

Outline n n n Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

Future Challenges n The factors of data integration challenges: q q Social: Data integration

Future Challenges n The factors of data integration challenges: q q Social: Data integration is fundamentally about getting people to collaborate and share data. complexity of integration n Data integration has been referred to as a problem as hard as AI, maybe even harder! n Our goal: create tools that facilitate data integration in a variety of scenarios.

Several Specific Challenges n Dataspaces: Pay-as-you-go data management n Uncertainty and lineage n Reusing

Several Specific Challenges n Dataspaces: Pay-as-you-go data management n Uncertainty and lineage n Reusing human attention

Dataspaces n n database system: create the schema first! data integration system: create the

Dataspaces n n database system: create the schema first! data integration system: create the semantic mappings first! fundamental shortcoming: long setup time! n Dataspaces: the idea of pay-as-you-go data management

Pay-as-you-go n n offer some services immediately without any setup time, and improve the

Pay-as-you-go n n offer some services immediately without any setup time, and improve the services as more investment is made into creating semantic relationships. A dataspace should offer keyword search over any data in any source with no setup time.

Pay-as-you-go Data Management Dataspaces: Franklin, Halevy, Maier [see PODS 2006] Benefit Dataspaces Data integration

Pay-as-you-go Data Management Dataspaces: Franklin, Halevy, Maier [see PODS 2006] Benefit Dataspaces Data integration solutions Investment (time, cost)

Several Specific Challenges n Dataspaces: Pay-as-you-go data management n Uncertainty and lineage n Reusing

Several Specific Challenges n Dataspaces: Pay-as-you-go data management n Uncertainty and lineage n Reusing human attention

Uncertain data & data lineage n A necessity in data integration system n introspect

Uncertain data & data lineage n A necessity in data integration system n introspect about the certainty of the data n when not automatically determine its certainty, refer the user to the lineage of the data n Web search engines provide URLs along with their search results, so users can consider the URLs in the decision of which results to explore further.

Several Specific Challenges n Dataspaces: Pay-as-you-go data management n Uncertainty and lineage n Reusing

Several Specific Challenges n Dataspaces: Pay-as-you-go data management n Uncertainty and lineage n Reusing human attention

Reusing human attention n n achieving tighter semantic integration among data sources Users’ any

Reusing human attention n n achieving tighter semantic integration among data sources Users’ any operation to data sources: Giving a semantic clue about the data or about relationships between data sources Systems that leverage these semantic clues: obtain semantic integration much faster an area for additional research and development

Outline n n n Motivation & Background Best Paper: Information Manifold Building on the

Outline n n n Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

Conclusion time not so long ago today data integration a nice feature and an

Conclusion time not so long ago today data integration a nice feature and an area for intellectual curiosity a necessity n Today’s economy further emphasize the need for data integration solutions. n Thomas Friedman: The World is Flat.

A Framework for Deep Web Integration Developed issue Developing issue Undeveloped issue Our focuses

A Framework for Deep Web Integration Developed issue Developing issue Undeveloped issue Our focuses

Q&A

Q&A