Indexing Dataspaces Xin Luna Dong University of Washington
- Slides: 54
Indexing Dataspaces Xin (Luna) Dong University of Washington Alon Halevy @ SIGMOD 2007 Presented by Aditya Sakhuja Google Inc.
Hi, I am n n n Aditya Sakhuja 1 st semester MS – CS, Co. C Ongoing research: Automation of schema matching Interested in online enabled apps ( should be path breaking ), IR , databases, security, latest web technologies, sky diving and scuba diving from INDIA
Outline Ø Motivation q Overview of our approach q Our algorithm q. Indexing structure q. Indexing hierarchies q Experimental q Conclusions Results
Many Data Management Applications Need to Manage Heterogeneous Data Sources D 5 D 1 D 2 D 4 D 3
Traditional Data Integration Systems SELECT P. title AS title, P. year AS year, A. name AS author FROM Author AS A, Paper AS P, Authored. By AS B WHERE A. aid=B. aid AND P. pid=B. pid Publication (title, year, author) Mediated Schema D 5 D 1 Author (aid, name) Paper (pid, title, year) Authored. By (aid, pid) D 2 D 4 D 3
Querying on Traditional Data Integration Systems Q Q 5 Q 1 D 1 Mediated Schema Q 4 Q Q 2 D 2 Q 3 D 4 D 3 D 5
In Many Applications it is Hard to Obtain Precise Semantic Mappings ? D 1 D 2 D 5 D 4 D 3
Scenario 1. Different Websites About Movies
Scenario 2. Personal Information Space Intranet Internet
Querying Dataspaces n Dataspaces ¨ Collections of heterogeneous data sources ¨ Don’t necessarily include semantic mappings ¨ Scenarios: personal information, enterprises, government agencies, smart homes, digital libraries, and the Web How to effectively query and search a dataspace?
Example Dataspace <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … …
Searching and Querying a Dataspace n Structured query? ¨ Require detailed knowledge on schemas ¨ Require precise attribute values n Keyword search? ¨ Does n not allow specifications on structure We consider queries that are ¨ keyword-based ¨ structure-aware
I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> n <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> n </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> Conjunction of predicates Predicate: (v, {K 1, …, Kn}) ¨v - an attribute or association label ¨ {K 1, …, Kn} - a keyword set stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … …
I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> n Example I: (title, ‘Semex’) (author, ‘Luna Dong’) stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … …
I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> n Example II: (name, ‘Dong’) <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … …
II. Neighborhood Keyword Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> n n <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> Form: {K 1, …, Kn} Example: ‘Semex’ ¨ Relevant items ¨ Associated items stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … …
Indexing of the Heterogeneous Data n Challenges ¨ Index data from heterogeneous data sources ¨ Capture both text values and structural information n Traditional Indexes ¨ Build a separate index for each attribute to support structured queries ¨ Build an inverted list to support keyword search ¨ XML indexes assume tree models and build multiple indexes ([Cooper et al. , 01], [Kaushik et al. , 05], [Wang et al. , 03], etc. ) Our approach: Extend inverted lists to capture both text values and structure of the data
Contributions n Design an index that ¨ indexes data from heterogeneous data sources ¨ captures both structure and text of the data ¨ incorporates various types of heterogeneity, including synonyms and hierarchies of attributes and associations
Outline þ Motivation F Overview of our approach q Our algorithm q. Indexing structure q. Indexing hierarchies q Experimental q Conclusions Results
View Data Sources as Triple Base <publication> <title>Semex: Toward …</title> <authors> <author><name> Xin Dong</name></author> <author><name> Alon Halevy</name></author> </authors>… </publication> Alon Halevy Semex: … author Luna Dong author Attribute Object Association
View Data Sources as Triple Base Alon Halevy Semex: … author Luna Dong author
View Data Sources as Triple Base Alon Halevy Semex: … author Luna Dong author Departmental Database Stu. ID last. Name first. Name … 1000001 Xin Dong … … … Goal: Index triples to efficiently answer queries that combine text and structure
Indexing a Triple Base Using an Inverted List Alon Halevy Semex: … author Luna Dong Inverted List Alon Dong Halevy Luna Semex Xin author Departmental Database Stu. ID last. Name first. Name … 1000001 Xin Dong … … …
Indexing a Triple Base Using an Inverted List Query: Dong Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author Inverted List Alon 1 Dong Halevy Luna Semex Xin 1 1 1
Outline þ Motivation þ Overview of our approach F Our algorithm FIndexing structure q. Indexing hierarchies q Experimental q Conclusions Results
Incorporate Attribute Labels in the Inverted List Query: first. Name “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author Inverted List Alon 1 Dong Halevy Luna Semex Xin 1 1 1
Incorporate Attribute Labels in the Inverted List Query: first. Name “Dong” “Dong/first. Name/” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author Inverted List Alon/name/ 1 Dong/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/last. Name/ 1 1 1
Incorporate Association Labels in the Inverted List Query: author “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author Inverted List Alon/name/ 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Incorporate Association Labels in the Inverted List Alon Halevy Query: author “Dong” “Dong/author/” Departmental Database Semex: … author Luna Dong author Stu. ID Last. Name First. Name … 1000001 Xin Dong … … … Inverted List Alon/author/ Alon/name/ 1 1 Dong/author/ 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ 1 1 1 Luna/auhor 1 Semex/title/ 1 Xin/name/Last. Name/ 1
Outline þ Motivation þ Overview of our approach F Our algorithm þIndexing structure FIndexing hierarchies q Experimental q Conclusions Results
Hierarchies of Attributes and Associations <publication> <title>Semex: Toward on-the-fly personal information integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entry. Year>2001</entry. Year> </student> </thesis-proposal> n Example II: (name, ‘Dong’) Attribute Hierarchy: name first. Name stu. ID last. Name first. Name entry. Year 5001438 Xin Dong 2001 … … last. Name
Incorporate Attribute Hierarchy in the Inverted List Query: name “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/last. Name/ 1 1 1
Naïve Approach: Expand Queries with Sub-Attributes Query: name “Dong” “Dong/name/ OR Dong/first. Name/ OR …” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/last. Name/ 1 1 1
Approach I: Duplicate Entries for Parent Attributes Query: name “Dong” “Dong/name/” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/name/ 1 1 Dong/first. Name/ Halevy/name/ Luna/name/ Semex/title/ 1 1 Xin/last. Name/ 1 Xin/name/ 1
Approach I: Duplicate Entries for Parent Attributes Query: name “Dong” “Dong/name/” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/name/ 1 1 Dong/first. Name/ Halevy/name/ Luna/name/ Semex/title/ 1 1 Xin/last. Name/ 1 Xin/name/ 1
Approach II. Concatenate a keyword with a Hierarchy Path Query: name “Dong” “Dong/name/*” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Approach III. Hierarchy Path + Summary Rows Query: name “Dong” “Dong/name/*” Alon Halevy Departmental Database Semex: … author Luna Dong Stu. ID last. Name first. Name … 1000001 Xin Dong … … … author name first. Name last. Name Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Summary Rows n Goal: Given a threshold t, answer any prefix search by reading no more than t rows. n Definition: ¨ The indexed keyword: p// E. g. “Dong/name//” ¨ Rows starting with p/ are shadowed by the summary row p// E. g. “Dong/name/last. Name/” is shadowed by “Dong/name//”
Answering Prefix Search with Summary Rows n Once read a summary row, ignore the rows shadowed by it n Example (t=1) Query: name “Dong” “Dong/name/*” Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Answering Prefix Search with Summary Rows n Once read a summary row, ignore the rows shadowed by it n Example (t=1) Query: name “Xin” “Xin/name/*” Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Adding Summary Rows n Step 1. Create a summary row for a prefix p if ¨ Searching prefix p needs to read more than t rows ¨ There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows n Step 2. Remove row p if summary row p/ exists n Example (t=1) Inverted List Alon/name/ 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Adding Summary Rows n Step 1. Create a summary row for a prefix p if ¨ Searching prefix p needs to read more than t rows ¨ There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows n Step 2. Remove row p if summary row p/ exists n Example (t=1) Inverted List Alon/name/ 1 Dong/name/ 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Adding Summary Rows n Step 1. Create a summary row for a prefix p if ¨ Searching prefix p needs to read more than t rows ¨ There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows n Step 2. Remove row p if summary row p/ exists n Example (t=1) Inverted List Alon/name/ 1 Dong/name/ 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Adding Summary Rows n Step 1. Create a summary row for a prefix p if ¨ Searching prefix p needs to read more than t rows ¨ There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows n Step 2. Remove row p if summary row p/ exists n Example (t=1) Inverted List Alon/name/ 1 Dong/name// 1 1 Dong/name/first. Name/ Halevy/name/ Luna/name/ Semex/title/ Xin/name/last. Name/ 1 1 1
Answering Neighborhood Keyword Queries Alon Halevy Query: Semex “Semex/*” ~author Semex: … Departmental Database author Luna Dong author Stu. ID Last. Name First. Name … 1000001 Xin Dong … … … ~author Inverted List Alon/author/ Alon/name/ 1 1 Dong/author/ 1 Dong/name/first. Name/ Halevy/name/ 1 1 Luna/name/ Semex/~author/ Semex/title/ Xin/name/Last. Name/ 1 1 1
Outline þ Motivation þ Overview of our approach þ Our algorithm þIndexing structure þIndexing hierarchies F Experimental q Conclusions Results
Implementation Details n Our index extends the Lucene Indexing Tool ¨ Lucene stores an inverted list as a sorted array n Implemented in Java n Run on a machine with four 3. 2 GHz and 1024 KB-cache CPUs, and 1 GB memory
Experimental Setting n Data sets ¨A 50 MB personal data set ¨ Two 10 GB XML data sets: Wikipedia, XMark Benchmark n Queries: with one predicate or keyword ¨ Predicate Query with leaf attributes ¨ Predicate Query with branch attributes ¨ Predicate Query with associations ¨ Neighborhood Keyword Query n Measure: in millisecond ¨ Index-lookup time ¨ Query-answering time
Our Indexing Method Significantly Improves Query Answering Plain Inverted List (10. 6 MB) Query Type Extended Inverted List (15. 2 MB) Index Lookup Query Answer (ms) Pred Query with leaf attributes 2 22 4 6 Pred Query with branch attributes 3 43 4 6 Pred Query with associations 3 88 6 17 Neighborhood Keyword Query 18 4174 48 97
XML Index [Kaushik et al, Sigmod’ 05] n Three indexes ¨ Inverted list: index each attribute value on its text ¨ Structured index: index each attribute value on the labels of the attribute and its ancestor attributes ¨ Relationship index: index each instance on its associated instances
Our Indexing Method Performs Better Than XML Indexes XML Index (28. 1 MB) Query Type Extended Inverted List (15. 2 MB) Index Lookup Query Answer (ms) Pred Query with leaf attributes 7 9 4 6 Pred Query with branch attributes 7 11 4 6 Pred Query with associations 301 415 6 17 Neighborhood Keyword Query 365 488 48 97
Our Indexing Method Scales Well Wikipedia XMark w/o asso XMark with asso Index 4. 15 hr (1. 13 GB) 6. 64 hr (3. 04 GB) 12. 72 hr (4. 08 GB) Pred Query with leaf attributes 156 94 116 Pred Query with branch attributes - 67 93 Pred Query with associations - - 217 Neighborhood Keyword Query 1646 1838 13468
Conclusions n Contributions: An index for heterogeneous data ¨ Index heterogeneous data from multiple sources through a (virtual) central triple base ¨ Extend inverted lists to capture both texts and structure of data n Future Work ¨ Support value heterogeneity ¨ Incorporate approximate matching of schema terms and object instances
- Ding dong ding dong christmas bells are ringing
- Luna_xuany
- Pre coordinate indexing system
- Contoh pengindeksan manual
- Dong a university
- Washington university rotc
- University of washington credit card
- Western washington university human services
- George washington university css code
- George washington university electrical engineering
- Tom anderson university of washington
- Lead poisoning
- Bionic lens
- Hank webber
- University of washington emba
- "post university" -liu -washington
- University of washington emba
- George washington university electrical engineering
- Tom anderson university of washington
- University of washington css
- Ding dong hypothesis
- Peter dong
- Dong nao jin maths
- Lis trong quản lý đơn hàng
- Dong quai nedir
- Jae dong noh
- Sơ đồ mạch điện chiều dòng điện
- Mishima yukio
- Changyu dong
- Hawmin
- Dong liu ustc
- Dong pyou han
- Stanford ugvr
- Lan dong
- Halimbawa ng salitang kolokyal
- Dong liu ustc
- Yuxiao dong
- Cây mọc lên từ hạt
- Dong pei li
- Xiaolong dong
- Dong vo
- Iigcc
- Dong liu ustc
- Yang zai
- Dong sun-hwa
- Có mấy loại dòng biển
- Phân biệt nhị và nhụy
- Ziqian dong
- Bài 33 dòng điện xoay chiều
- Reynold xin
- Con xin dâng lên muôn lời suy tôn
- Con thờ lạy hết tình
- Lạy cha xin tha cho họ
- Một người ăn xin đã già
- Big data is like teenage