i Trails Payasyougo Information Integration in Dataspaces Marcos

  • Slides: 42
Download presentation
i. Trails: Pay-as-you-go Information Integration in Dataspaces Marcos Vaz Salles Jens Dittrich Shant Karakashian

i. Trails: Pay-as-you-go Information Integration in Dataspaces Marcos Vaz Salles Jens Dittrich Shant Karakashian Olivier Girard Lukas Blunschi ETH Zurich VLDB 2007 September 26, 2007

Outline § Motivation § i. Trails § Experiments § Conclusions and Future Work September

Outline § Motivation § i. Trails § Experiments § Conclusions and Future Work September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Problem: Querying Several Sources Query What is the impact of global warming in Zurich?

Problem: Querying Several Sources Query What is the impact of global warming in Zurich? ? ? Systems Data Sources Laptop September 26, 2007 Email Server Web Server Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch DB Server

Solution 1: Use a Search Engine Query Job! global warming zurich Graph IR Search

Solution 1: Use a Search Engine Query Job! global warming zurich Graph IR Search Engine System Top. X [VLDB 05], Fle. XPath semantics [SIGMOD 04], Drawback: Query are not precise! XSearch [VLDB 03], XRank [SIGMOD 03] text, links Data Sources Laptop September 26, 2007 Email Server Web DB Server Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Solution 2: Use an Information Integration System //Temperatures/*[city = “zurich”] . . Temps Cities

Solution 2: Use an Information Integration System //Temperatures/*[city = “zurich”] . . Temps Cities Query Information Integration System Drawback: Too much effort to provide. . . System schema mappings! GAV (e. g. [ICDE 95]), LAV (e. g. [VLDB 96]), CO 2 Sunspots GLAV [AAAI 99], P 2 P (e. g. [SIGMOD 04]) missing schema mapping Data Sources Laptop September 26, 2007 Email Server Web Server Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch DB Server

Research Challenge: Is There an Integration Solution in-between These Two Extremes? global warming zurich

Research Challenge: Is There an Integration Solution in-between These Two Extremes? global warming zurich //Temperatures/*[city = “zurich”] global warming zurich ? Graph IR Search Engine Pay-as-you-go text, Information links Integration text, links Dataspace. . . System. . . text, links Data Sources Laptop September 26, 2007 Email Server Web DB Server Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch . . . Temps Cities CO 2 Sunspots Information Integration System full-blown schema mappings Data Sources Dataspace Vision by Franklin, Halevy, and Maier [SIGMOD Record 05]

Outline § Motivation § i. Trails § Experiments § Conclusions and Future Work September

Outline § Motivation § i. Trails § Experiments § Conclusions and Future Work September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

i. Trails Core Idea: Add Integration Hints Incrementally § Step 1: Provide a search

i. Trails Core Idea: Add Integration Hints Incrementally § Step 1: Provide a search service over all the data Use a general graph data model (see VLDB 2006) § Works for unstructured documents, XML, and relations § § Step 2: Add integration semantics via hints (trails) on top of the graph § Works across data sources, not only between sources § Step 3: If more semantics needed, go back to step 2 § Impact: § Smooth transition between search and data integration § Semantics added incrementally improve precision / recall September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

i. Trails: Defining Trails § Basic Form of a Trail Queries: NEXI-like keyword and

i. Trails: Defining Trails § Basic Form of a Trail Queries: NEXI-like keyword and path expressions QL [. CL] → QR [. CR] Attribute projections § Intuition: When I query for QL [. CL], you should also query for QR [. CR] September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Trail Examples: Global Warming Zurich § Trail for Implicit Meaning: global warming zurich “When

Trail Examples: Global Warming Zurich § Trail for Implicit Meaning: global warming zurich “When I query for global warming, you should also query for Temperature data above 10 degrees” Temperatures date city region celsius 24 -Sep Bern BE 20 24 -Sep Uster 25 -Sep Zurich ZH 15 ZH 14 26 -Sep Zurich ZH 9 global warming → //Temperatures/*[celsius > 10] § Trail for an Entity: “When I query for zurich, you should also query for references of zurich as a region” zurich → //*[region = “ZH”] September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch DB Server

Trail Example: Deep Web Bookmarks Web Server train home § Trail for a Bookmark:

Trail Example: Deep Web Bookmarks Web Server train home § Trail for a Bookmark: “When I query for train home, you should also query for the Train. Company’s website with origin at ETH Uni and destination at Seilbahn Rigiblick” train home → //train. Company. com//*[origin=“ETH Uni” and dest =“Seilbahn Rigiblick”] September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Trail Examples: Thesauri, Dictionaries, Language-agnostic Search car auto Laptop Email Server § Trail for

Trail Examples: Thesauri, Dictionaries, Language-agnostic Search car auto Laptop Email Server § Trail for Thesauri: “When I query for car, you should also query for auto” car → auto carro § Trails for Dictionary: “When I query for car, you should also query for carro and vice-versa” car → carro → car September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Trail Examples: Schema Equivalences § Trail for schema match on Employee emp. Id emp.

Trail Examples: Schema Equivalences § Trail for schema match on Employee emp. Id emp. Name salary names: “When I query for Employee. emp. Name, you should also query for Person. name” //Employee//*. tuple. emp. Name → //Person//*. tuple. name Person SSN DB Server name age income § Trail for schema match on salaries: “When I query for Employee. salary, you should also query for Person. income” //Employee//*. tuple. salary → //Person//*. tuple. income September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

§ Core Idea § Trail Examples § How are Trails Created? § Uncertainty and

§ Core Idea § Trail Examples § How are Trails Created? § Uncertainty and Trails § i. Trails § Rewriting Queries with Trails § Experiments § Recursive Matches Outline § Motivation § Conclusion and Future Work September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

How are Trails Created? § Given by the user § Explicitly § Via Relevance

How are Trails Created? § Given by the user § Explicitly § Via Relevance Feedback § (Semi-)Automatically § Information extraction techniques § Automatic schema matching § Ontologies and thesauri (e. g. , wordnet) § User communities (e. g. , trails on gene data, bookmarks) September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Uncertainty and Trails § Probabilistic Trails: § model uncertain trails § probabilities used to

Uncertainty and Trails § Probabilistic Trails: § model uncertain trails § probabilities used to rank trails QL [. CL] → Q [. C ], 0 ≤ p ≤ 1 R R p § Example: car → auto p = 0. 8 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Certainty and Trails § Scored Trails: § give higher value to certain trails §

Certainty and Trails § Scored Trails: § give higher value to certain trails § scoring factors used to boost scores of query results obtained by the trail QL [. CL] → Q [. C ], sf > 1 R R sf § Examples: - T 1: weather → //Temperatures/* p = 0. 9, sf = 2 - T 2: yesterday → //*[date = today() – 1] p = 1, sf = 3 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Rewriting Queries with Trails U U Query weather yesterday T 2 matches Trail U

Rewriting Queries with Trails U U Query weather yesterday T 2 matches Trail U yesterday //*[date = today() – 1] T 2: yesterday → //*[date = today() – 1] (1) Matching September 26, 2007 weather (3) Merging (2) Transformation Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Replacing Trails § Trails that use replace instead of union semantics U U Query

Replacing Trails § Trails that use replace instead of union semantics U U Query weather yesterday weather (3) Merging //*[date = today() – 1] T 2 matches Trail T 2: yesterday //*[date = today() – 1] (1) Matching September 26, 2007 (2) Transformation Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Problem: Recursive Matches (1/2) U weather U yesterday New query still matches T 2,

Problem: Recursive Matches (1/2) U weather U yesterday New query still matches T 2, so T 2 could be applied again //*[date = today() – 1] T 2 matches U weather T 2: yesterday → //*[date = today() – 1] U. . . U U T 2 yesterday matches September 26, 2007 U //*[date = today() – 1] . . . //*[date = today() – 1] Infinite recursion Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Problem: Recursive Matches (2/2) U weather U yesterday T 3 matches Trails may be

Problem: Recursive Matches (2/2) U weather U yesterday T 3 matches Trails may be mutually recursive //*[date = today() – 1] U weather //*. tuple. modified T 10: //*. tuple. modified → //*. tuple. date yesterday T 10 matches U //*[modified = today() – 1] //*[date = today() – 1] U T 3: //*. tuple. date → U weather U yesterday U We again match T 3 and enter an infinite loop U //*[date = today() – 1] //*[modified = today() – 1] September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Solution: Multiple Match Coloring Algorithm T 3, T 4 match U First Level U

Solution: Multiple Match Coloring Algorithm T 3, T 4 match U First Level U U //*[date = today() – 1] yesterday U weather //Temperatures/* T 1 matches T 2 matches U weather yesterday U Second Level U yesterday weather //Temperatures/* U U //*[date = today() – 1] //*[received = today() – 1] T 1: T 2: T 3: T 4: //*[modified = today() – 1] weather → //Temperatures/* yesterday → //*[date = today() – 1] //*. tuple. date → //*. tuple. modified //*. tuple. date → //*. tuple. received September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Multiple Match Coloring Algorithm Analysis § Problem: MMCA is exponential in number of levels

Multiple Match Coloring Algorithm Analysis § Problem: MMCA is exponential in number of levels § Solution: Trail Pruning § Prune by number of levels § Prune by top-K trails matched in each level § Prune by both top-K trails and number of levels September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Outline § Motivation § i. Trails § Experiments § Conclusion and Future Work September

Outline § Motivation § i. Trails § Experiments § Conclusion and Future Work September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

i. Trails Evaluation in i. Mex § i. Mex Dataspace System: Open-source prototype available

i. Trails Evaluation in i. Mex § i. Mex Dataspace System: Open-source prototype available at http: //www. imemex. org § Main Questions in Evaluation § Quality: Top-K Precision and Recall § Performance: Use of Materialization § Scalability: Query-rewrite Time vs. Number of Trails September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

i. Trails Evaluation in i. Mex § Scenario 1: Few High-quality Trails § Closer

i. Trails Evaluation in i. Mex § Scenario 1: Few High-quality Trails § Closer to information integration use cases § Obtained real datasets and indexed them § 18 hand-crafted trails § 14 hand-crafted queries § Scenario 2: Many Low-quality Trails § Closer to search use cases § Generated up to 10, 000 trails September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

i. Trails Evaluation in i. Mex: Scenario 1 § Configured i. Mex to act

i. Trails Evaluation in i. Mex: Scenario 1 § Configured i. Mex to act in three modes § Baseline: Graph / IR search engine § i. Trails: Rewrite search queries with trails § Perfect Query: Semantics-aware query § Data: shipped to central index sizes in MB Laptop September 26, 2007 Web Server Email Server Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch DB Server

Quality: Top-K Precision and Recall perfect query K = 20 Scenario 1: few high-quality

Quality: Top-K Precision and Recall perfect query K = 20 Scenario 1: few high-quality trails (18 trails) Search Engine misses relevant results Queries Search Query is partially semantics-aware Q 13: to = Q 3: pdf raimund. grube@ enron. com yesterday September 26, 2007 Perfect Query always has precision and recall equal to 1 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Performance: Use of Materialization Scenario 1: few high-quality trails (18 trails) Trail merging adds

Performance: Use of Materialization Scenario 1: few high-quality trails (18 trails) Trail merging adds overhead to query execution Trail Materialization provides interactive times for all queries response times in sec. September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Scalability: Query-rewrite Time vs. Number of Trails Scenario 2: many low-quality trails Query-rewrite time

Scalability: Query-rewrite Time vs. Number of Trails Scenario 2: many low-quality trails Query-rewrite time can be controlled with pruning September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Conclusion: Pay-as-you-go Information Integration global warming zurich Dataspace System § Step 1: Provide a

Conclusion: Pay-as-you-go Information Integration global warming zurich Dataspace System § Step 1: Provide a search service over all the data text, links § Step 2: Add integration semantics via trails Data Sources § Step 3: If more semantics needed, go back to step 2 § Our Contributions i. Trails: generic method to model semantic relationships (e. g. implicit meaning, bookmarks, dictionaries, thesauri, attribute matches, . . . ) § We propose a framework and algorithms for Pay-as-yougo Information Integration § Smooth transition between search and data integration § September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Future Work § Trail Creation § Use collections (ontologies, thesauri, wikipedia) § Work on

Future Work § Trail Creation § Use collections (ontologies, thesauri, wikipedia) § Work on automatic mining of trails from the dataspace § Other types of trails § Associations § Lineage September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Questions? Thanks in advance for your feedback! marcos. vazsalles@inf. ethz. ch http: //www. imemex.

Questions? Thanks in advance for your feedback! marcos. vazsalles@inf. ethz. ch http: //www. imemex. org September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Backup Slides September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf.

Backup Slides September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Problem: Global Warming in Zurich § Query: “What is the impact of global warming

Problem: Global Warming in Zurich § Query: “What is the impact of global warming in Zurich? ” § Search for: global warming zurich § Meaning of keyword query § § September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch global warming should lead to query on Temperatures zurich should lead to a query for a city

Problem: PDF Yesterday § Query: “Retrieve all PDF documents added/modified yesterday” § Search for:

Problem: PDF Yesterday § Query: “Retrieve all PDF documents added/modified yesterday” § Search for: pdf yesterday § Meaning of keywords and yesterday § Different sources, different schemas: pdf Laptop: modified § Email: received § DBMS: changed § September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Related Work: Search vs. Data Integration vs. Dataspaces Integration Solution Features September 26, 2007

Related Work: Search vs. Data Integration vs. Dataspaces Integration Solution Features September 26, 2007 Search Dataspaces Data Integration Effort Low Pay-as-yougo High Query Semantics Precision / Recall Precise Need for Schemanever Schemalater Schemafirst Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Personal Dataspaces Literature § Dittrich, Salles, Kossmann, Blunschi. i. Mex: Escapes from the §

Personal Dataspaces Literature § Dittrich, Salles, Kossmann, Blunschi. i. Mex: Escapes from the § § § Personal Information Jungle (Demo Paper). VLDB, September 2005. Dittrich, Salles. i. DM: A Unified and Versatile Data Model for Personal Dataspace Management. VLDB, September 2006 Dittrich. i. Mex: A Platform for Personal Dataspace Management. SIGIR PIM, August 2006. Blunschi, Dittrich, Girard, Karakashian, Salles. A Dataspace Odyssey: The i. Mex Personal Dataspace Management System (Demo Paper). CIDR, January 2007. Dittrich, Blunschi, Färber, Girard, Karakashian, Salles. From Personal Desktops to Personal Dataspaces: A Report on Building the i. Mex Personal Dataspace Management System. BTW 2007, March 2007 Salles, Dittrich, Karakashian, Girard, Blunschi. i. Trails: Pay-as-yougo Information Integration in Dataspaces. VLDB, September 2007 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

i. DM: i. Mex Data Model § Our approach: get the data model closer

i. DM: i. Mex Data Model § Our approach: get the data model closer to personal information – not the other way around § Supports: § Unstructured, semi-structured and structured data, e. g. , files&folders, XML, relations § Clearly separation of logical and physical representation of data § Arbitrary directed graph structures, e. g. , section references in La. Te. X documents, links in filesystems, etc § Lazily computed data, e. g. , Active. XML (Abiteboul et. al. ) § Infinite data, e. g. , media and data streams See VLDB 2006 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch

Data Model Options Data Models Bag of Words Relational XML Support for Graph data

Data Model Options Data Models Bag of Words Relational XML Support for Graph data Specific schema Extension: XLink/ XPointer Support for Lazy Computation View mechanism Extension: Active. XML Extension: Relational streams Extension: XML streams Nonschematic data Serialization independent Support for Personal Data Support for Infinite data September 26, 2007 Extension: Document streams Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch i. DM

Data Models for Personal Information Abstraction Level lower higher Relational i. DM Physical Level

Data Models for Personal Information Abstraction Level lower higher Relational i. DM Physical Level XML Document / Bag of Words September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch Personal Information

Architectural Perspective of i. Mex Complex operators (query algebra) Indexes&Replicas access (warehousing) Data source

Architectural Perspective of i. Mex Complex operators (query algebra) Indexes&Replicas access (warehousing) Data source access (mediation) September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch