Indexing Dataspaces Xin Luna Dong Alon Halevy University
- Slides: 52
Indexing Dataspaces Xin (Luna) Dong Alon Halevy University of Washington @ SIGMOD 2007 Google Inc.
Many Data Management Applications Need to Manage Heterogeneous Data Sources D 5 D 1 D 2 D 4 D 3
Traditional Data Integration Systems SELECT P. title AS title, P. year AS year, A. name AS author FROM Author AS A, Paper AS P, Authored. By AS B WHERE A. aid=B. aid AND P. pid=B. pid Publication(title, year, author) Mediated Schema D 5 D 1 Author(aid, name) Paper(pid, title, year) Authored. By(aid, pid) D 2 D 4 D 3
Querying on Traditional Data Integration Systems Q Q 5 Q 1 D 1 Mediated Schema Q 4 Q Q 2 D 2 Q 3 D 4 D 3 D 5
In Many Applications it is Hard to Obtain Precise Semantic Mappings ? D 1 D 2 D 5 D 4 D 3
Scenario 1. Different Websites About Movies
Scenario 2. Personal Information Space Intranet Internet
Querying Dataspaces n Dataspaces ¨ Collections of heterogeneous data sources ¨ Don’t necessarily include semantic mappings ¨ Scenarios: personal information, enterprises, government agencies, smart homes, digital libraries, and the Web n Pay-as-you-go data management ¨ Provide some services from the outset ¨ Improve the mappings on an as-needed basis n How to effectively query and search a dataspace?
Example Dataspace <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … …
Searching and Querying a Dataspace n Structured query? ¨ Require detailed knowledge on schemas ¨ Require precise attribute values n Keyword search? ¨ Forgiving, but… ¨ Does not allow specifications on structure n We consider queries that are ¨ keyword-based ¨ structure-aware
I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> n <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> n </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> Conjunction of predicates Predicate: (v, {K 1, …, Kn}) ¨v - an attribute or association label ¨ {K 1, …, Kn} - a keyword set stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … …
I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> n Example I: (title, ‘Semex’) (author, ‘Luna Dong’) stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … …
I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> n Example II: (name, ‘Dong’) <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … …
II. Neighborhood Keyword Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> n n <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> Form: {K 1, …, Kn} Example: ‘Semex’ ¨ Relevant items ¨ Associated items stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … …
Indexing of the Heterogeneous Data n Challenges ¨ Index data from heterogeneous data sources ¨ Capture both text values and structural information n Traditional Indexes ¨ Build a separate index for each attribute to support structured queries ¨ Build an inverted list to support keyword search ¨ XML indexes assume tree models and build multiple indexes ([Cooper et al. , 01], [Kaushik et al. , 05], [Wang et al. , 03], etc. ) Our approach: Extend inverted lists to capture both text values and structure of the data
Contributions n Design an index that ¨ indexes data from heterogeneous data sources ¨ captures both structure and text of the data ¨ incorporates various types of heterogeneity, including synonyms and hierarchies of attributes and associations
Outline þ Motivation F Overview of our approach q Our algorithm q. Indexing structure q. Indexing hierarchies q Experimental q Conclusions Results
View Data Sources as Triple Base <publication> <title>Semex: Toward …</title> <authors> <author><name> Xin Dong</name></author> <author><name> Alon Halevy</name></author> </authors>… </publication> Alon Halevy Semex: … author Luna Dong author Attribute Object Association
View Data Sources as Triple Base Alon Halevy Semex: … author Luna Dong author
View Data Sources as Triple Base Alon Halevy Semex: … author Luna Dong author Departmental Database Stu. ID last. Name first. Name … 1000001 Xin Dong … … … Goal: Index triples to efficiently answer queries that combine text and structure
Indexing a Triple Base Using an Inverted List Alon Halevy Semex: … author Luna Dong Inverted List Alon Dong Halevy Luna Semex Xin author Departmental Database Stu. ID last. Name first. Name … 1000001 Xin Dong … … …
Indexing a Triple Base Using an Inverted List Query: Dong Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author Inverted List Alon 1 Dong Halevy Luna Semex Xin 1 1 1
Outline þ Motivation þ Overview of our approach F Our algorithm FIndexing structure q. Indexing hierarchies q Experimental q Conclusions Results
Incorporate Attribute Labels in the Inverted List Query: first. Name “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author Inverted List Alon 1 Dong Halevy Luna Semex Xin 1 1 1
Incorporate Attribute Labels in the Inverted List Query: first. Name “Dong” “Dong/first. Name/” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author Inverted List Alon/name/ 1 Dong/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/last. Name/ 1 1 1
Incorporate Association Labels in the Inverted List Query: author “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author Inverted List Alon/name/ 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Incorporate Association Labels in the Inverted List Alon Halevy Query: author “Dong” “Dong/author/” Departmental Database Semex: … author Luna Dong author Stu. ID Last. Name First. Name … 1000001 Xin Dong … … … Inverted List Alon/author/ Alon/name/ 1 1 Dong/author/ 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ 1 1 1 Luna/auhor 1 Semex/title/ 1 Xin/name/Last. Name/ 1
Outline þ Motivation þ Overview of our approach F Our algorithm þIndexing structure FIndexing hierarchies q Experimental q Conclusions Results
Hierarchies of Attributes and Associations <publication> <title>Semex: Toward on-the-fly personal information integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> n Example II: (name, ‘Dong’) Attribute Hierarchy: name first. Name stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … … last. Name
Incorporate Attribute Hierarchy in the Inverted List Query: name “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/last. Name/ 1 1 1
Naïve Approach: Expand Queries with Sub-Attributes Query: name “Dong” “Dong/name/ OR Dong/first. Name/ OR …” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/last. Name/ 1 1 1
Approach I: Duplicate Entries for Parent Attributes Query: name “Dong” “Dong/name/” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/name/ 1 1 Dong/first. Name/ Halevy/name/ Luna/name/ Semex/title/ 1 1 Xin/last. Name/ 1 Xin/name/ 1
Approach I: Duplicate Entries for Parent Attributes Query: name “Dong” “Dong/name/” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/name/ 1 1 Dong/first. Name/ Halevy/name/ Luna/name/ Semex/title/ 1 1 Xin/last. Name/ 1 Xin/name/ 1
Approach II. Concatenate a keyword with a Hierarchy Path Query: name “Dong” “Dong/name/*” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Approach III. Hierarchy Path + Summary Rows Query: name “Dong” “Dong/name/*” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Summary Rows n Goal: Given a threshold t, answer any prefix search by reading no more than t rows. n Definition: ¨ The indexed keyword: p// E. g. “Dong/name//” ¨ Rows starting with p/ are shadowed by the summary row p// E. g. “Dong/name/last. Name/” is shadowed by “Dong/name//”
Answering Prefix Search with Summary Rows n Once read a summary row, ignore the rows shadowed by it n Example (t=1) Query: name “Dong” “Dong/name/*” Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Answering Prefix Search with Summary Rows n Once read a summary row, ignore the rows shadowed by it n Example (t=1) Query: name “Xin” “Xin/name/*” Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Adding Summary Rows n Step 1. Create a summary row for a prefix p if ¨ Searching prefix p needs to read more than t rows ¨ There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows n Step 2. Remove row p if summary row p/ exists n Example (t=1) Inverted List Alon/name/ 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Adding Summary Rows n Step 1. Create a summary row for a prefix p if ¨ Searching prefix p needs to read more than t rows ¨ There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows n Step 2. Remove row p if summary row p/ exists n Example (t=1) Inverted List Alon/name/ 1 Dong/name/ 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Adding Summary Rows n Step 1. Create a summary row for a prefix p if ¨ Searching prefix p needs to read more than t rows ¨ There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows n Step 2. Remove row p if summary row p/ exists n Example (t=1) Inverted List Alon/name/ 1 Dong/name/ 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Adding Summary Rows n Step 1. Create a summary row for a prefix p if ¨ Searching prefix p needs to read more than t rows ¨ There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows n Step 2. Remove row p if summary row p/ exists n Example (t=1) Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Answering Neighborhood Keyword Queries Alon Halevy Query: Semex “Semex/*” ~author Semex: … Departmental Database author Luna Dong author Stu. ID Last. Name First. Name … 1000001 Xin Dong … … … ~author Inverted List Alon/author/ Alon/name/ 1 1 Dong/author/ 1 Dong/name/first. Name/ Halevy/name/ 1 1 Luna/name/ Semex/~author/ Semex/title/ Xin/name/Last. Name/ 1 1 1
Outline þ Motivation þ Overview of our approach þ Our algorithm þIndexing structure þIndexing hierarchies F Experimental q Conclusions Results
Implementation Details n Our index extends the Lucene Indexing Tool ¨ Lucene stores an inverted list as a sorted array n Implemented in Java n Run on a machine with four 3. 2 GHz and 1024 KB-cache CPUs, and 1 GB memory
Experimental Setting n Data sets ¨A 50 MB personal data set ¨ Two 10 GB XML data sets: Wikipedia, XMark Benchmark n Queries: with one predicate or keyword ¨ Predicate Query with leaf attributes ¨ Predicate Query with branch attributes ¨ Predicate Query with associations ¨ Neighborhood Keyword Query n Measure: in millisecond ¨ Index-lookup time ¨ Query-answering time
Our Indexing Method Significantly Improves Query Answering Plain Inverted List (10. 6 MB) Query Type Extended Inverted List (15. 2 MB) Index Lookup Query Answer (ms) Pred Query with leaf attributes 2 22 4 6 Pred Query with branch attributes 3 43 4 6 Pred Query with associations 3 88 6 17 Neighborhood Keyword Query 18 4174 48 97
XML Index [Kaushik et al, Sigmod’ 05] n Three indexes ¨ Inverted list: index each attribute value on its text ¨ Structured index: index each attribute value on the labels of the attribute and its ancestor attributes ¨ Relationship index: index each instance on its associated instances
Our Indexing Method Performs Better Than XML Indexes XML Index (28. 1 MB) Query Type Extended Inverted List (15. 2 MB) Index Lookup Query Answer (ms) Pred Query with leaf attributes 7 9 4 6 Pred Query with branch attributes 7 11 4 6 Pred Query with associations 301 415 6 17 Neighborhood Keyword Query 365 488 48 97
Our Indexing Method Scales Well Wikipedia XMark w/o asso XMark with asso Index 4. 15 hr (1. 13 GB) 6. 64 hr (3. 04 GB) 12. 72 hr (4. 08 GB) Pred Query with leaf attributes 156 94 116 Pred Query with branch attributes - 67 93 Pred Query with associations - - 217 Neighborhood Keyword Query 1646 1838 13468
Conclusions n Contributions: An index for heterogeneous data ¨ Index heterogeneous data from multiple sources through a (virtual) central triple base ¨ Extend inverted lists to capture both texts and structure of data n Future Work ¨ Support value heterogeneity ¨ Incorporate approximate matching of schema terms and object instances
Indexing Dataspaces Xin (Luna) Dong Alon Halevy University of Washington @ SIGMOD 2007 Google Inc.
- Ding dong, ding dong christmas bells are ringing
- Alon halevy
- Alon halevy
- Luna xuan
- What is pre coordinate indexing system
- Manual indexing vs automatic indexing
- Dong a university
- Ano ang masasabi mo tungkol sa daigdig bilang planeta
- Alon lavie
- Alon brav
- Alon global marketing download
- Alon granot
- Alon efrat
- Alon-matias-szegedy algorithm
- Alon domb
- Ilan alon
- Elad alon
- Alon rubin
- We mean business
- Ano ang balbal
- Percival zhang
- Dong quai nedir
- Bài 33 dòng điện xoay chiều
- Có mấy loại dòng biển
- Dong pei li
- Changyu dong
- Peter dong
- Dong liu ustc
- Dong liu ustc
- Jae dong noh
- Hoa mướp đực và cái
- Xiaolong dong
- Sakkik
- Stanford ugvr
- Dong nao jin maths
- Rang dong restaurant
- Yuxiao dong
- Sơ đồ mạch điện chiều dòng điện
- Hoa rang
- Dong liu ustc
- Thi lan hot
- Khái niệm hệ thống thông tin logistics
- Dong sun-hwa
- Ziqian dong
- Hoa dong riềng thụ phấn nhờ gì
- Erika dong
- Pooh pooh theory of language
- Con thờ lạy hết tình
- Mindfulness chinese symbol
- Xin wang columbia law
- Xin than linh den
- Funeral banner of lady dai
- Quan xin