Indexing Dataspaces Xin Luna Dong University of Washington

Hi, I am n n n Aditya Sakhuja 1 st semester MS – CS,

Outline Ø Motivation q Overview of our approach q Our algorithm q. Indexing structure

Many Data Management Applications Need to Manage Heterogeneous Data Sources D 5 D 1

Traditional Data Integration Systems SELECT P. title AS title, P. year AS year, A.

Querying on Traditional Data Integration Systems Q Q 5 Q 1 D 1 Mediated

In Many Applications it is Hard to Obtain Precise Semantic Mappings ? D 1

Scenario 1. Different Websites About Movies

Scenario 2. Personal Information Space Intranet Internet

Querying Dataspaces n Dataspaces ¨ Collections of heterogeneous data sources ¨ Don’t necessarily include

Example Dataspace <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb

Searching and Querying a Dataspace n Structured query? ¨ Require detailed knowledge on schemas

I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> n <author>Alon

I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author>

II. Neighborhood Keyword Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon

Indexing of the Heterogeneous Data n Challenges ¨ Index data from heterogeneous data sources

Contributions n Design an index that ¨ indexes data from heterogeneous data sources ¨

Outline þ Motivation F Overview of our approach q Our algorithm q. Indexing structure

View Data Sources as Triple Base <publication> <title>Semex: Toward …</title> <authors> <author><name> Xin Dong</name></author>

View Data Sources as Triple Base Alon Halevy Semex: … author Luna Dong author

Indexing a Triple Base Using an Inverted List Alon Halevy Semex: … author Luna

Indexing a Triple Base Using an Inverted List Query: Dong Alon Halevy Departmental Database

Outline þ Motivation þ Overview of our approach F Our algorithm FIndexing structure q.

Incorporate Attribute Labels in the Inverted List Query: first. Name “Dong” Alon Halevy Departmental

Incorporate Attribute Labels in the Inverted List Query: first. Name “Dong” “Dong/first. Name/” Alon

Incorporate Association Labels in the Inverted List Query: author “Dong” Alon Halevy Departmental Database

Incorporate Association Labels in the Inverted List Alon Halevy Query: author “Dong” “Dong/author/” Departmental

Outline þ Motivation þ Overview of our approach F Our algorithm þIndexing structure FIndexing

Hierarchies of Attributes and Associations <publication> <title>Semex: Toward on-the-fly personal information integration</title> <author>Xin Dong</author>

Incorporate Attribute Hierarchy in the Inverted List Query: name “Dong” Alon Halevy Departmental Database

Naïve Approach: Expand Queries with Sub-Attributes Query: name “Dong” “Dong/name/ OR Dong/first. Name/ OR

Approach I: Duplicate Entries for Parent Attributes Query: name “Dong” “Dong/name/” Alon Halevy Departmental

Approach II. Concatenate a keyword with a Hierarchy Path Query: name “Dong” “Dong/name/*” Alon

Approach III. Hierarchy Path + Summary Rows Query: name “Dong” “Dong/name/*” Alon Halevy Departmental

Summary Rows n Goal: Given a threshold t, answer any prefix search by reading

Answering Prefix Search with Summary Rows n Once read a summary row, ignore the

Adding Summary Rows n Step 1. Create a summary row for a prefix p

Answering Neighborhood Keyword Queries Alon Halevy Query: Semex “Semex/*” ~author Semex: … Departmental Database

Outline þ Motivation þ Overview of our approach þ Our algorithm þIndexing structure þIndexing

Implementation Details n Our index extends the Lucene Indexing Tool ¨ Lucene stores an

Experimental Setting n Data sets ¨A 50 MB personal data set ¨ Two 10

Our Indexing Method Significantly Improves Query Answering Plain Inverted List (10. 6 MB) Query

XML Index [Kaushik et al, Sigmod’ 05] n Three indexes ¨ Inverted list: index

Our Indexing Method Performs Better Than XML Indexes XML Index (28. 1 MB) Query

Our Indexing Method Scales Well Wikipedia XMark w/o asso XMark with asso Index 4.

Conclusions n Contributions: An index for heterogeneous data ¨ Index heterogeneous data from multiple

Slides: 54

Download presentation

Indexing Dataspaces Xin (Luna) Dong University of Washington Alon Halevy @ SIGMOD 2007 Presented by Aditya Sakhuja Google Inc.

Hi, I am n n n Aditya Sakhuja 1 st semester MS – CS, Co. C Ongoing research: Automation of schema matching Interested in online enabled apps ( should be path breaking ), IR , databases, security, latest web technologies, sky diving and scuba diving from INDIA

Outline Ø Motivation q Overview of our approach q Our algorithm q. Indexing structure q. Indexing hierarchies q Experimental q Conclusions Results

Many Data Management Applications Need to Manage Heterogeneous Data Sources D 5 D 1 D 2 D 4 D 3

Traditional Data Integration Systems SELECT P. title AS title, P. year AS year, A. name AS author FROM Author AS A, Paper AS P, Authored. By AS B WHERE A. aid=B. aid AND P. pid=B. pid Publication (title, year, author) Mediated Schema D 5 D 1 Author (aid, name) Paper (pid, title, year) Authored. By (aid, pid) D 2 D 4 D 3

Querying on Traditional Data Integration Systems Q Q 5 Q 1 D 1 Mediated Schema Q 4 Q Q 2 D 2 Q 3 D 4 D 3 D 5

In Many Applications it is Hard to Obtain Precise Semantic Mappings ? D 1 D 2 D 5 D 4 D 3

Scenario 1. Different Websites About Movies

Scenario 2. Personal Information Space Intranet Internet

Querying Dataspaces n Dataspaces ¨ Collections of heterogeneous data sources ¨ Don’t necessarily include semantic mappings ¨ Scenarios: personal information, enterprises, government agencies, smart homes, digital libraries, and the Web How to effectively query and search a dataspace?

Example Dataspace <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … …

Searching and Querying a Dataspace n Structured query? ¨ Require detailed knowledge on schemas ¨ Require precise attribute values n Keyword search? ¨ Does n not allow specifications on structure We consider queries that are ¨ keyword-based ¨ structure-aware

I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> n <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> n </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> Conjunction of predicates Predicate: (v, {K 1, …, Kn}) ¨v - an attribute or association label ¨ {K 1, …, Kn} - a keyword set stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … …

I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> n Example I: (title, ‘Semex’) (author, ‘Luna Dong’) stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … …

I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> n Example II: (name, ‘Dong’) <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … …

II. Neighborhood Keyword Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> n n <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> Form: {K 1, …, Kn} Example: ‘Semex’ ¨ Relevant items ¨ Associated items stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … …

Indexing of the Heterogeneous Data n Challenges ¨ Index data from heterogeneous data sources ¨ Capture both text values and structural information n Traditional Indexes ¨ Build a separate index for each attribute to support structured queries ¨ Build an inverted list to support keyword search ¨ XML indexes assume tree models and build multiple indexes ([Cooper et al. , 01], [Kaushik et al. , 05], [Wang et al. , 03], etc. ) Our approach: Extend inverted lists to capture both text values and structure of the data

Contributions n Design an index that ¨ indexes data from heterogeneous data sources ¨ captures both structure and text of the data ¨ incorporates various types of heterogeneity, including synonyms and hierarchies of attributes and associations

Outline þ Motivation F Overview of our approach q Our algorithm q. Indexing structure q. Indexing hierarchies q Experimental q Conclusions Results

View Data Sources as Triple Base <publication> <title>Semex: Toward …</title> <authors> <author><name> Xin Dong</name></author> <author><name> Alon Halevy</name></author> </authors>… </publication> Alon Halevy Semex: … author Luna Dong author Attribute Object Association

View Data Sources as Triple Base Alon Halevy Semex: … author Luna Dong author

View Data Sources as Triple Base Alon Halevy Semex: … author Luna Dong author Departmental Database Stu. ID last. Name first. Name … 1000001 Xin Dong … … … Goal: Index triples to efficiently answer queries that combine text and structure

Indexing a Triple Base Using an Inverted List Alon Halevy Semex: … author Luna Dong Inverted List Alon Dong Halevy Luna Semex Xin author Departmental Database Stu. ID last. Name first. Name … 1000001 Xin Dong … … …

Indexing a Triple Base Using an Inverted List Query: Dong Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author Inverted List Alon 1 Dong Halevy Luna Semex Xin 1 1 1

Outline þ Motivation þ Overview of our approach F Our algorithm FIndexing structure q. Indexing hierarchies q Experimental q Conclusions Results

Incorporate Attribute Labels in the Inverted List Query: first. Name “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author Inverted List Alon 1 Dong Halevy Luna Semex Xin 1 1 1

Incorporate Attribute Labels in the Inverted List Query: first. Name “Dong” “Dong/first. Name/” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author Inverted List Alon/name/ 1 Dong/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/last. Name/ 1 1 1

Incorporate Association Labels in the Inverted List Query: author “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author Inverted List Alon/name/ 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1

Incorporate Association Labels in the Inverted List Alon Halevy Query: author “Dong” “Dong/author/” Departmental Database Semex: … author Luna Dong author Stu. ID Last. Name First. Name … 1000001 Xin Dong … … … Inverted List Alon/author/ Alon/name/ 1 1 Dong/author/ 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ 1 1 1 Luna/auhor 1 Semex/title/ 1 Xin/name/Last. Name/ 1

Outline þ Motivation þ Overview of our approach F Our algorithm þIndexing structure FIndexing hierarchies q Experimental q Conclusions Results

Hierarchies of Attributes and Associations <publication> <title>Semex: Toward on-the-fly personal information integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> n Example II: (name, ‘Dong’) Attribute Hierarchy: name first. Name stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … … last. Name

Incorporate Attribute Hierarchy in the Inverted List Query: name “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/last. Name/ 1 1 1

Naïve Approach: Expand Queries with Sub-Attributes Query: name “Dong” “Dong/name/ OR Dong/first. Name/ OR …” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/last. Name/ 1 1 1

Approach I: Duplicate Entries for Parent Attributes Query: name “Dong” “Dong/name/” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/name/ 1 1 Dong/first. Name/ Halevy/name/ Luna/name/ Semex/title/ 1 1 Xin/last. Name/ 1 Xin/name/ 1

Approach II. Concatenate a keyword with a Hierarchy Path Query: name “Dong” “Dong/name/*” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1

Approach III. Hierarchy Path + Summary Rows Query: name “Dong” “Dong/name/*” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1

Summary Rows n Goal: Given a threshold t, answer any prefix search by reading no more than t rows. n Definition: ¨ The indexed keyword: p// E. g. “Dong/name//” ¨ Rows starting with p/ are shadowed by the summary row p// E. g. “Dong/name/last. Name/” is shadowed by “Dong/name//”

Answering Prefix Search with Summary Rows n Once read a summary row, ignore the rows shadowed by it n Example (t=1) Query: name “Dong” “Dong/name/*” Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1

Answering Prefix Search with Summary Rows n Once read a summary row, ignore the rows shadowed by it n Example (t=1) Query: name “Xin” “Xin/name/*” Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1

Adding Summary Rows n Step 1. Create a summary row for a prefix p if ¨ Searching prefix p needs to read more than t rows ¨ There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows n Step 2. Remove row p if summary row p/ exists n Example (t=1) Inverted List Alon/name/ 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1

Adding Summary Rows n Step 1. Create a summary row for a prefix p if ¨ Searching prefix p needs to read more than t rows ¨ There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows n Step 2. Remove row p if summary row p/ exists n Example (t=1) Inverted List Alon/name/ 1 Dong/name/ 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1

Adding Summary Rows n Step 1. Create a summary row for a prefix p if ¨ Searching prefix p needs to read more than t rows ¨ There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows n Step 2. Remove row p if summary row p/ exists n Example (t=1) Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1

Answering Neighborhood Keyword Queries Alon Halevy Query: Semex “Semex/*” ~author Semex: … Departmental Database author Luna Dong author Stu. ID Last. Name First. Name … 1000001 Xin Dong … … … ~author Inverted List Alon/author/ Alon/name/ 1 1 Dong/author/ 1 Dong/name/first. Name/ Halevy/name/ 1 1 Luna/name/ Semex/~author/ Semex/title/ Xin/name/Last. Name/ 1 1 1

Outline þ Motivation þ Overview of our approach þ Our algorithm þIndexing structure þIndexing hierarchies F Experimental q Conclusions Results

Implementation Details n Our index extends the Lucene Indexing Tool ¨ Lucene stores an inverted list as a sorted array n Implemented in Java n Run on a machine with four 3. 2 GHz and 1024 KB-cache CPUs, and 1 GB memory

Experimental Setting n Data sets ¨A 50 MB personal data set ¨ Two 10 GB XML data sets: Wikipedia, XMark Benchmark n Queries: with one predicate or keyword ¨ Predicate Query with leaf attributes ¨ Predicate Query with branch attributes ¨ Predicate Query with associations ¨ Neighborhood Keyword Query n Measure: in millisecond ¨ Index-lookup time ¨ Query-answering time

Our Indexing Method Significantly Improves Query Answering Plain Inverted List (10. 6 MB) Query Type Extended Inverted List (15. 2 MB) Index Lookup Query Answer (ms) Pred Query with leaf attributes 2 22 4 6 Pred Query with branch attributes 3 43 4 6 Pred Query with associations 3 88 6 17 Neighborhood Keyword Query 18 4174 48 97

XML Index [Kaushik et al, Sigmod’ 05] n Three indexes ¨ Inverted list: index each attribute value on its text ¨ Structured index: index each attribute value on the labels of the attribute and its ancestor attributes ¨ Relationship index: index each instance on its associated instances

Our Indexing Method Performs Better Than XML Indexes XML Index (28. 1 MB) Query Type Extended Inverted List (15. 2 MB) Index Lookup Query Answer (ms) Pred Query with leaf attributes 7 9 4 6 Pred Query with branch attributes 7 11 4 6 Pred Query with associations 301 415 6 17 Neighborhood Keyword Query 365 488 48 97

Our Indexing Method Scales Well Wikipedia XMark w/o asso XMark with asso Index 4. 15 hr (1. 13 GB) 6. 64 hr (3. 04 GB) 12. 72 hr (4. 08 GB) Pred Query with leaf attributes 156 94 116 Pred Query with branch attributes - 67 93 Pred Query with associations - - 217 Neighborhood Keyword Query 1646 1838 13468

Conclusions n Contributions: An index for heterogeneous data ¨ Index heterogeneous data from multiple sources through a (virtual) central triple base ¨ Extend inverted lists to capture both texts and structure of data n Future Work ¨ Support value heterogeneity ¨ Incorporate approximate matching of schema terms and object instances