i Trails Payasyougo Information Integration in Dataspaces Marcos
- Slides: 42
i. Trails: Pay-as-you-go Information Integration in Dataspaces Marcos Vaz Salles Jens Dittrich Shant Karakashian Olivier Girard Lukas Blunschi ETH Zurich VLDB 2007 September 26, 2007
Outline § Motivation § i. Trails § Experiments § Conclusions and Future Work September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Problem: Querying Several Sources Query What is the impact of global warming in Zurich? ? ? Systems Data Sources Laptop September 26, 2007 Email Server Web Server Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch DB Server
Solution 1: Use a Search Engine Query Job! global warming zurich Graph IR Search Engine System Top. X [VLDB 05], Fle. XPath semantics [SIGMOD 04], Drawback: Query are not precise! XSearch [VLDB 03], XRank [SIGMOD 03] text, links Data Sources Laptop September 26, 2007 Email Server Web DB Server Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Solution 2: Use an Information Integration System //Temperatures/*[city = “zurich”] . . Temps Cities Query Information Integration System Drawback: Too much effort to provide. . . System schema mappings! GAV (e. g. [ICDE 95]), LAV (e. g. [VLDB 96]), CO 2 Sunspots GLAV [AAAI 99], P 2 P (e. g. [SIGMOD 04]) missing schema mapping Data Sources Laptop September 26, 2007 Email Server Web Server Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch DB Server
Research Challenge: Is There an Integration Solution in-between These Two Extremes? global warming zurich //Temperatures/*[city = “zurich”] global warming zurich ? Graph IR Search Engine Pay-as-you-go text, Information links Integration text, links Dataspace. . . System. . . text, links Data Sources Laptop September 26, 2007 Email Server Web DB Server Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch . . . Temps Cities CO 2 Sunspots Information Integration System full-blown schema mappings Data Sources Dataspace Vision by Franklin, Halevy, and Maier [SIGMOD Record 05]
Outline § Motivation § i. Trails § Experiments § Conclusions and Future Work September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
i. Trails Core Idea: Add Integration Hints Incrementally § Step 1: Provide a search service over all the data Use a general graph data model (see VLDB 2006) § Works for unstructured documents, XML, and relations § § Step 2: Add integration semantics via hints (trails) on top of the graph § Works across data sources, not only between sources § Step 3: If more semantics needed, go back to step 2 § Impact: § Smooth transition between search and data integration § Semantics added incrementally improve precision / recall September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
i. Trails: Defining Trails § Basic Form of a Trail Queries: NEXI-like keyword and path expressions QL [. CL] → QR [. CR] Attribute projections § Intuition: When I query for QL [. CL], you should also query for QR [. CR] September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Trail Examples: Global Warming Zurich § Trail for Implicit Meaning: global warming zurich “When I query for global warming, you should also query for Temperature data above 10 degrees” Temperatures date city region celsius 24 -Sep Bern BE 20 24 -Sep Uster 25 -Sep Zurich ZH 15 ZH 14 26 -Sep Zurich ZH 9 global warming → //Temperatures/*[celsius > 10] § Trail for an Entity: “When I query for zurich, you should also query for references of zurich as a region” zurich → //*[region = “ZH”] September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch DB Server
Trail Example: Deep Web Bookmarks Web Server train home § Trail for a Bookmark: “When I query for train home, you should also query for the Train. Company’s website with origin at ETH Uni and destination at Seilbahn Rigiblick” train home → //train. Company. com//*[origin=“ETH Uni” and dest =“Seilbahn Rigiblick”] September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Trail Examples: Thesauri, Dictionaries, Language-agnostic Search car auto Laptop Email Server § Trail for Thesauri: “When I query for car, you should also query for auto” car → auto carro § Trails for Dictionary: “When I query for car, you should also query for carro and vice-versa” car → carro → car September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Trail Examples: Schema Equivalences § Trail for schema match on Employee emp. Id emp. Name salary names: “When I query for Employee. emp. Name, you should also query for Person. name” //Employee//*. tuple. emp. Name → //Person//*. tuple. name Person SSN DB Server name age income § Trail for schema match on salaries: “When I query for Employee. salary, you should also query for Person. income” //Employee//*. tuple. salary → //Person//*. tuple. income September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
§ Core Idea § Trail Examples § How are Trails Created? § Uncertainty and Trails § i. Trails § Rewriting Queries with Trails § Experiments § Recursive Matches Outline § Motivation § Conclusion and Future Work September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
How are Trails Created? § Given by the user § Explicitly § Via Relevance Feedback § (Semi-)Automatically § Information extraction techniques § Automatic schema matching § Ontologies and thesauri (e. g. , wordnet) § User communities (e. g. , trails on gene data, bookmarks) September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Uncertainty and Trails § Probabilistic Trails: § model uncertain trails § probabilities used to rank trails QL [. CL] → Q [. C ], 0 ≤ p ≤ 1 R R p § Example: car → auto p = 0. 8 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Certainty and Trails § Scored Trails: § give higher value to certain trails § scoring factors used to boost scores of query results obtained by the trail QL [. CL] → Q [. C ], sf > 1 R R sf § Examples: - T 1: weather → //Temperatures/* p = 0. 9, sf = 2 - T 2: yesterday → //*[date = today() – 1] p = 1, sf = 3 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Rewriting Queries with Trails U U Query weather yesterday T 2 matches Trail U yesterday //*[date = today() – 1] T 2: yesterday → //*[date = today() – 1] (1) Matching September 26, 2007 weather (3) Merging (2) Transformation Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Replacing Trails § Trails that use replace instead of union semantics U U Query weather yesterday weather (3) Merging //*[date = today() – 1] T 2 matches Trail T 2: yesterday //*[date = today() – 1] (1) Matching September 26, 2007 (2) Transformation Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Problem: Recursive Matches (1/2) U weather U yesterday New query still matches T 2, so T 2 could be applied again //*[date = today() – 1] T 2 matches U weather T 2: yesterday → //*[date = today() – 1] U. . . U U T 2 yesterday matches September 26, 2007 U //*[date = today() – 1] . . . //*[date = today() – 1] Infinite recursion Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Problem: Recursive Matches (2/2) U weather U yesterday T 3 matches Trails may be mutually recursive //*[date = today() – 1] U weather //*. tuple. modified T 10: //*. tuple. modified → //*. tuple. date yesterday T 10 matches U //*[modified = today() – 1] //*[date = today() – 1] U T 3: //*. tuple. date → U weather U yesterday U We again match T 3 and enter an infinite loop U //*[date = today() – 1] //*[modified = today() – 1] September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Solution: Multiple Match Coloring Algorithm T 3, T 4 match U First Level U U //*[date = today() – 1] yesterday U weather //Temperatures/* T 1 matches T 2 matches U weather yesterday U Second Level U yesterday weather //Temperatures/* U U //*[date = today() – 1] //*[received = today() – 1] T 1: T 2: T 3: T 4: //*[modified = today() – 1] weather → //Temperatures/* yesterday → //*[date = today() – 1] //*. tuple. date → //*. tuple. modified //*. tuple. date → //*. tuple. received September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Multiple Match Coloring Algorithm Analysis § Problem: MMCA is exponential in number of levels § Solution: Trail Pruning § Prune by number of levels § Prune by top-K trails matched in each level § Prune by both top-K trails and number of levels September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Outline § Motivation § i. Trails § Experiments § Conclusion and Future Work September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
i. Trails Evaluation in i. Mex § i. Mex Dataspace System: Open-source prototype available at http: //www. imemex. org § Main Questions in Evaluation § Quality: Top-K Precision and Recall § Performance: Use of Materialization § Scalability: Query-rewrite Time vs. Number of Trails September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
i. Trails Evaluation in i. Mex § Scenario 1: Few High-quality Trails § Closer to information integration use cases § Obtained real datasets and indexed them § 18 hand-crafted trails § 14 hand-crafted queries § Scenario 2: Many Low-quality Trails § Closer to search use cases § Generated up to 10, 000 trails September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
i. Trails Evaluation in i. Mex: Scenario 1 § Configured i. Mex to act in three modes § Baseline: Graph / IR search engine § i. Trails: Rewrite search queries with trails § Perfect Query: Semantics-aware query § Data: shipped to central index sizes in MB Laptop September 26, 2007 Web Server Email Server Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch DB Server
Quality: Top-K Precision and Recall perfect query K = 20 Scenario 1: few high-quality trails (18 trails) Search Engine misses relevant results Queries Search Query is partially semantics-aware Q 13: to = Q 3: pdf raimund. grube@ enron. com yesterday September 26, 2007 Perfect Query always has precision and recall equal to 1 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Performance: Use of Materialization Scenario 1: few high-quality trails (18 trails) Trail merging adds overhead to query execution Trail Materialization provides interactive times for all queries response times in sec. September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Scalability: Query-rewrite Time vs. Number of Trails Scenario 2: many low-quality trails Query-rewrite time can be controlled with pruning September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Conclusion: Pay-as-you-go Information Integration global warming zurich Dataspace System § Step 1: Provide a search service over all the data text, links § Step 2: Add integration semantics via trails Data Sources § Step 3: If more semantics needed, go back to step 2 § Our Contributions i. Trails: generic method to model semantic relationships (e. g. implicit meaning, bookmarks, dictionaries, thesauri, attribute matches, . . . ) § We propose a framework and algorithms for Pay-as-yougo Information Integration § Smooth transition between search and data integration § September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Future Work § Trail Creation § Use collections (ontologies, thesauri, wikipedia) § Work on automatic mining of trails from the dataspace § Other types of trails § Associations § Lineage September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Questions? Thanks in advance for your feedback! marcos. vazsalles@inf. ethz. ch http: //www. imemex. org September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Backup Slides September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Problem: Global Warming in Zurich § Query: “What is the impact of global warming in Zurich? ” § Search for: global warming zurich § Meaning of keyword query § § September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch global warming should lead to query on Temperatures zurich should lead to a query for a city
Problem: PDF Yesterday § Query: “Retrieve all PDF documents added/modified yesterday” § Search for: pdf yesterday § Meaning of keywords and yesterday § Different sources, different schemas: pdf Laptop: modified § Email: received § DBMS: changed § September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Related Work: Search vs. Data Integration vs. Dataspaces Integration Solution Features September 26, 2007 Search Dataspaces Data Integration Effort Low Pay-as-yougo High Query Semantics Precision / Recall Precise Need for Schemanever Schemalater Schemafirst Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Personal Dataspaces Literature § Dittrich, Salles, Kossmann, Blunschi. i. Mex: Escapes from the § § § Personal Information Jungle (Demo Paper). VLDB, September 2005. Dittrich, Salles. i. DM: A Unified and Versatile Data Model for Personal Dataspace Management. VLDB, September 2006 Dittrich. i. Mex: A Platform for Personal Dataspace Management. SIGIR PIM, August 2006. Blunschi, Dittrich, Girard, Karakashian, Salles. A Dataspace Odyssey: The i. Mex Personal Dataspace Management System (Demo Paper). CIDR, January 2007. Dittrich, Blunschi, Färber, Girard, Karakashian, Salles. From Personal Desktops to Personal Dataspaces: A Report on Building the i. Mex Personal Dataspace Management System. BTW 2007, March 2007 Salles, Dittrich, Karakashian, Girard, Blunschi. i. Trails: Pay-as-yougo Information Integration in Dataspaces. VLDB, September 2007 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
i. DM: i. Mex Data Model § Our approach: get the data model closer to personal information – not the other way around § Supports: § Unstructured, semi-structured and structured data, e. g. , files&folders, XML, relations § Clearly separation of logical and physical representation of data § Arbitrary directed graph structures, e. g. , section references in La. Te. X documents, links in filesystems, etc § Lazily computed data, e. g. , Active. XML (Abiteboul et. al. ) § Infinite data, e. g. , media and data streams See VLDB 2006 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
Data Model Options Data Models Bag of Words Relational XML Support for Graph data Specific schema Extension: XLink/ XPointer Support for Lazy Computation View mechanism Extension: Active. XML Extension: Relational streams Extension: XML streams Nonschematic data Serialization independent Support for Personal Data Support for Infinite data September 26, 2007 Extension: Document streams Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch i. DM
Data Models for Personal Information Abstraction Level lower higher Relational i. DM Physical Level XML Document / Bag of Words September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch Personal Information
Architectural Perspective of i. Mex Complex operators (query algebra) Indexes&Replicas access (warehousing) Data source access (mediation) September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos. vazsalles@inf. ethz. ch
- Wellesley trails
- Edsby pembina trails
- Ample trails
- Closing the frontier apush
- Oregon fever apush
- Pony express apush definition
- Forest trails academy
- Ample trails
- Forward integration and backward integration
- Forward backward integration
- Simultaneous integration
- Portal-oriented application integration
- Information oriented approach
- Information-oriented application integration
- Incomplete information vs imperfect information
- Himno de la universidad nacional mayor de san marcos
- Marcos 15 20-32
- Mc 10,21
- Marcos 16:17-20
- Marcos 5:21-43
- Manuel roxas policy
- Ejemplo de matriz de marco logico de un proyecto social
- Marcos (6,30-34):
- Marcos 6 17-29 reflexion
- Marcos 2:1-12
- Marcos 13 34
- Marcos 11:26
- Marcos 10 13
- Qual o nome dele
- Initial conflict definition literature
- Lectura del evangelio
- Marcos 10-9
- Marcos 14:22
- Marcos alonso de la garza y arcon
- Marcos 10, 17-30
- Contenido del evangelio de marcos
- Etiquetas para marcos en html
- Alfredo marcos
- Marcos regis
- Listados y otros marcos muestrales
- Marcelo marcos morales
- Marcos 4-35
- Fray marcos rodriguez robles