Structure Query Processing Data models Query models Approaches

Data / Data Models • Textual – Bag-of-words – Represent documents, text in structured

Data / Data Models • Textual • Structured – – Resource Description Framework (RDF)

Data / Data Models • Textual • Structured • Hybrid – Textual and structured

Query / Query Models • Unstructured • Fully-structured • Hybrid: unstructured + structured

Query / Query Models • Unstructured – NL – Keywords book price 30

Query / Query Models • Unstructured • Fully-structured – SQL: select, from, where •

Query / Query Models • Unstructured • Fully-structured – SQL – SPARQL – Conjunctive

Query / Query Models • Fully-structured • Unstructured • Hybrid: content and structure constraints

Query Processing • Matching queries against data

Approaches – Taxonomy (1) Matching Query Data • Complete • Sound • Approximate •

Approaches – Taxonomy (2) Textual Data Keyword query on textual data (Standard IR) Unstructured

Keyword Query / Textual Data • Retrieve documents • Inverted list (inverted index) keyword

Structured Query / Structured Data • Retrieve data for triple patterns • Index on

Keyword Query / Structured Data • Retrieve keyword elements • Using inverted index keyword

References • Günter Ladwig, Thanh Tran: Combining Query Translation with Query Answering for Efficient

Structured Query / Textual Data • • • Based on offline IE (offline see

References • Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni: Structured Querying of

Query Processing – Main Tasks • Retrieval Matching Query Data – Documents , data

Query Processing – More Tasks Matching Query Data • Disjunction, aggregation, grouping • Join

Query Processing on the Web Research Challenges and Opportunities • Large amount of semantic

Approaches Textual Data Keyword query on textual data (Standard IR) Structured query on textual

Keyword Query Processing on Graph-Structured RDF Data

Keyword Search in DBs / Keyword Translation (Kacholia et al. , VLDB 05) User

Query Space (Tran et al. , ICDE 2009) Schema graph derived from data graph

Top-k Query Graph Exploration on Query Space Query space, three paths from keyword matching

Structured Query Processing on Graph-Structured RDF Data

Query Processing • Structured query: conjunctive queries – Conjunctive queries on graph-structured data amounts

Answer Space (Tran et al. , Sem. Data@VLDB 2010) An extended example of the

Structural-aware Matching Using Answer Space The answer space An example query • Match query

Query Processing on the Web • Routing • Find combinations of sources • Federation

Linked Data More M Links ta ore D a - 203 linked datasets serve

Challenges “Articles from awarded researchers at Stanford ” • Large number of unknown, unprocessed

Searching Linked Data • Given the needs (expressed as sets of keywords), – are

Keyword Query Routing (Tran et al. , ISWC 2010)

Keyword Query Routing • Linked data (schema and data are linked) • Routing based

LOD Data Graph • Web data modeled as a set of interlinked data graphs

LOD Schema Graph • Web data modeled as a set of interlinked data graphs

LOD Source Graph • Web data modeled as a set of interlinked data graphs

Keyword Query Answers User information need „stanford Freebase … John. title type pub 2

Problem Definition • Keyword query result (also called Steiner graph) is a subgraph of

A Valid Keyword Routing Plan User information need „stanford Freebase … John. title type

• • • The Search Space Multi-level inter-relationship graphs capture the entire search

Keyword Sets • One keyword set for every data source • Elements stand for

Element-level Keyword-Element Relationship Graph (E- KERG) A keyword-element captures a keyword k and the

Schema-level Keyword-Element Relationship Graph (S-KERG) • • • A keyword-element captures a keyword k

Data-Source-level Keyword-Element Relationship Graph (D-KERG) • • • A keyword-element captures a keyword k

Routing Plan Computation • Keyword sets – Retrieve elements for every keyword k in

Mixed, Corrective and Stream-based Linked Data Query Processing

Structured Query Processing on the Web • Linked data (schema and data are linked)

Top-down Query Evaluation (Harth et al. , WWW 2010) • Local index of sources,

Bottom-up Query Evaluation (Hartig et al. , ISWC 2009) • Sources discovered at run-time

Mixed Strategy (Ladwig et al. ISWC 2010) • Combination of top-down and bottom-up strategies

Stream-based Query Processing • – Construct query plan – Probe local index for sources

Push-based Symmetric Hash Join t 7 t 4 • Operation t 7 t 5

Corrective Source Ranking • Prefer more relevant sources • Relevancy of a source is

Source Features and Metrics – Source is more relevant if it contains data that

Metric Correction and Refinement • During query processing new information becomes available: intermediate join

Conclusions • Query processing: which kinds of data, queries? – Focus: textual & structured

References • Thanh Tran, Günter Ladwig: Structure Index for RDF. Sem. Data@VLDB 2010 •

Slides: 65

Download presentation

Structure • Query Processing – Data models – Query models – Approaches – Challenges • Keyword query processing on RDF • Structured query processing on the Web – Routing needs to linked data sources – Linked data query processing

Query Processing

Query Processing Matching Query Data

Data / Data Models • Textual – Bag-of-words – Represent documents, text in structured data, …, realworld objects (captured as structured data) – Miss “structured information” • in text, e. g. linguistic structure, hyperlinks, (positional information) • in structured data term (statistics) combination In combination with Cloud Computing technologies, Cloud promising solutions for Computing the management of `big Technologies data' have emerged. solutions Existing industry solutions management are able `big data'to support complex queries and industry analytics tasks with solutions terabytes of data. For support example, using a complex Greenplum. ……

Data / Data Models • Textual • Structured – – Resource Description Framework (RDF) Represent real-world objects, services, applications, …. documents Resource attribute values and relationships between resources Schema

Data / Data Models • Textual • Structured • Hybrid – Textual and structured data

Query / Query Models • Unstructured • Fully-structured • Hybrid: unstructured + structured

Query / Query Models • Unstructured – NL – Keywords book price 30

Query / Query Models • Unstructured • Fully-structured – SQL: select, from, where • SELECT title, price FROM Books WHERE Price < 30

Query / Query Models • Unstructured • Fully-structured – SQL: select, from, where – SPARQL: BGP, filter, optional, union, select, construct, ask, describe • PREFIX dc: <http: //purl. org/dc/elements/1. 1/> PREFIX ns: <http: //example. org/ns#> SELECT ? title ? price WHERE { ? x dc: title ? title. OPTIONAL { ? x ns: price ? price. FILTER (? price < 30) } } UNION { ? book dc 11: title ? title. ? book dc 11: creator ? author } }

Query / Query Models • Unstructured • Fully-structured – SQL – SPARQL – Conjunctive queries, e. g. , graph patterns (BGP)

Query / Query Models • Fully-structured • Unstructured • Hybrid: content and structure constraints

Query Processing • Matching queries against data

Approaches – Taxonomy (1) Matching Query Data • Complete • Sound • Approximate • Not complete • Not sound • Ranked • Best effort • Top-k Query processing focuses on efficiency whereas ranking deals with result quality!

Approaches – Taxonomy (2) Textual Data Keyword query on textual data (Standard IR) Unstructured Query Structured query on textual data Hybrid query (XML IR) Keyword query on structured data Structured query on structured data (standard DB) Structured Data Structured Query

Keyword Query / Textual Data • Retrieve documents • Inverted list (inverted index) keyword {<doc 1, pos, score, . . . >, <doc 2, pos, score, . . . >, . . . } • AND-semantics: top-k join = =

Structured Query / Structured Data • Retrieve data for triple patterns • Index on tables • Multiple “redundant” indexes to cover different access patterns • Join (conjunction of triples) • Blocking, e. g. linear merge join (required sorted input) • Non-blocking, e. g. symmetric hash-join • Materialized join indexes SP-index PO-index = = =

Keyword Query / Structured Data • Retrieve keyword elements • Using inverted index keyword {<el 1, score, . . . >, <el 2, score, . . . >, …} • Exploration / “Join” • Data indexes for triple lookup • Materialized index (paths up to graphs) • Top-k Steiner tree search, top-k subgraph exploration ↔ ↔ = =

References • Günter Ladwig, Thanh Tran: Combining Query Translation with Query Answering for Efficient Keyword Search. ESWC 2010: 288 -303 • Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano: Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data. ICDE 2009: 405 -416 • Guoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou: EASE: an effective 3 -in-1 keyword search method for unstructured, semi-structured and structured data. SIGMOD 2008: 903 -914 • Thanh Tran, Philipp Cimiano, Sebastian Rudolph, Rudi Studer: Ontology-Based Interpretation of Keywords for Semantic Search. ISWC/ASWC 2007: 523 -536 • Hao He, Haixun Wang, Jun Yang, Philip S. Yu: BLINKS: ranked keyword searches on graphs. SIGMOD 2007: 305 -316 • Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S. Sudarshan, Rushi Desai, Hrishikesh Karambelkar: Bidirectional Expansion For Keyword Search on Graph Databases. VLDB 2005: 505 -516

Structured Query / Textual Data • • • Based on offline IE (offline see Peter’s slides) Based on online IE, i. e. , “retrieve “ is as follows • Derive keywords to retrieve relevant documents • On-the-fly information extraction, i. e. , phrase pattern matching “X title Y” • Retrieve extracted data for structured part • Retrieve documents for derived text patterns, e. g. sequence, windows, reg. exp. Index • Inverted index for document retrieval and pattern matching • Join index inverted index for storing materialized joins between keywords • Neighborhood indexes for phrase patterns Hybrid case

References • Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni: Structured Querying of Web Text Data: A Technical Challenge. CIDR 2007: 225 -234 • S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in typeannotated corpora. In WWW, pages 717– 726, 2006. • S. Agrawal, K. Chakrabarti, S. Chaudhuri, and V. Ganti: Scalable ad-hoc entity extraction from text collections. PVLDB, 1(1), 2008. • M. J. Cafarella. Extracting and querying a comprehensive web database. In CIDR, 2009. • G. Ramakrishnan, S. Balakrishnan, and S. Joshi. Entity annotation using inverse index operations. In EMNLP, 2006. • M. Cafarella and O. Etzioni. A search engine for natural language applications. In WWW, 2006.

Query Processing – Main Tasks • Retrieval Matching Query Data – Documents , data elements, triples, paths, graphs – Inverted index, …, but also other indexes (B+ tree) – Index documents, triples materialized join paths • Join – Different join implementations, efficiency depends on availability of indexes – Non-blocking join good for early result reporting and for “unpredictable” linked data scenario

Query Processing – More Tasks Matching Query Data • Disjunction, aggregation, grouping • Join order optimization • Approximate – Approximate the search space – Retrieve only some results – Approximate the join • Parallelization • Top-k – Use only some entries in the input streams to produce k results • Multiple sources – On-the-fly mapping, similarity join – Federation, routing • Hybrid – Join text and data

Query Processing on the Web Research Challenges and Opportunities • Large amount of semantic data • Data inconsistent, redundant, and low quality • Large amount of data embedded in text • Large amount of sources • Large amount of links between sources • Optimization parallelization, • Approximation • Hybrid querying and data management • Federation, routing • Online schema mappings • Similarity join

Approaches Textual Data Keyword query on textual data (Standard IR) Structured query on textual data (DB – IR) Unstructured Query Structured Query Keyword query on structured Search Space Approximation data (IR-DB) Structured query Routing, on structured Approximation, Adaptivedata Optimization (standard DB) Structured Data

Keyword Query Processing on Graph-Structured RDF Data

Keyword Search in DBs / Keyword Translation (Kacholia et al. , VLDB 05) User information need „stanford article turing award“ Specification • Keywords might produce large number of matching elements in the data graphs • The data graphs might be large in size • Search complexity increases substantially with the size of the data graphs • Large number of results Translation 28

Query Space (Tran et al. , ICDE 2009) Schema graph derived from data graph • Main Idea Query space = connecting keyword elements with schema elements § Exploration on much reduced summary model – Query space: more compact representation of the data graph query space • called Online construction of query space out of schema graph – Match keywords against labels of resources to find keyword elements § Substantially complexity – Connect keyworddecrease elements with elements of schema graph to obtain query space Online top-k query graph for exploration § • Top-k procedure graph exploration to compute only the top-k most relevant results

Top-k Query Graph Exploration on Query Space Query space, three paths from keyword matching elements, and costs of elements • • • Cost-directed exploration of minimal Steiner graphs Explore all possible distinct paths starting from keyword elements At each exploration, take current path with lowest cost When a connecting element is found, merge paths to obtain a candidate Top-k terminates when • highest cost in the candidate list (the cost of the k-ranked query graph) < lowest possible cost that can achieved with paths in the queues

Structured Query Processing on Graph-Structured RDF Data

Query Processing • Structured query: conjunctive queries – Conjunctive queries on graph-structured data amounts to the task of graph-pattern matching § A solution for determining matching requires exponential time § Search complexity increases substantially with the size of the graph § The size of the graph is very large on the Web of linked data 32

Answer Space (Tran et al. , Sem. Data@VLDB 2010) An extended example of the data graph The resulting answer space • § Construction of answer spacedata is based on bisimulation Summary model for general graphs • § Answer space Structure-based data partitioning to store data that share structures – Comprises of classes (extensions) and relations between them – Resources in an extension exhibit same structure, i. e. , the § Structure-aware processing to the filter candidates andhave prune same (incoming and outgoing) paths queries using a smaller answer space – Is a structural description more fine-granular then a schema

Structural-aware Matching Using Answer Space The answer space An example query • Match query against answer space – Answer space matches contain elements satisfying the query structure • Focus on answer spaces matches to compute final answers – Prune query parts containing non-distinguished variables only – Match remaining query against data graph (i. e. , focus on elements in the answer space matches identified and loaded before) • Advantages: reduction in IO cost and number of union & joins

Query Processing on the Linked Data Web

Query Processing on the Web • Routing • Find combinations of sources • Federation • Query parts sources • Combining results from different sources • Online schema mappings • Similarity join

Linked Data More M Links ta ore D a - 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links - As of 09 -2010 + other linked data not covered by LOD cloud 37

Challenges “Articles from awarded researchers at Stanford ” • Large number of unknown, unprocessed & irrelevant sources! – What is in there? – What is out there? – What is relevant? Formulating queries is a hard task! • Which data sources? USABILITY • Which schema elements? Processing queries is expensive! • Process against all data sources? SCALABILITY

Searching Linked Data • Given the needs (expressed as sets of keywords), – are there answers in processed linked data? – what combination of data sources produce them? – how to incorporate related unprocessed linked sources? § § Identify Keyword valid Query combination Routing of sources Identify schema elements Let user choose combination of sources § Focus on this combination of sources and related Linked Data Query Processing linked sources § 40

Keyword Query Routing (Tran et al. , ISWC 2010)

Keyword Query Routing • Linked data (schema and data are linked) • Routing based on keywords • Find combinations of sources

LOD Data Graph • Web data modeled as a set of interlinked data graphs • Each data graph represent a source • Data graph vs. schema graph vs. source graph Freebase DBLP … John. title uni 1 pub 2 employ author per 2 name Stanford University pub 1 same. As name John Mc. Carthy pub 3 author per 1 author same. As name John Mccarthy DBPedia John Smith name per 4 per 3 prizes name John Mc. Carthy Music Award label prize 2 prize 1 label Turing Award 43

LOD Schema Graph • Web data modeled as a set of interlinked data graphs • Each data graph represent a source • Data graph vs. schema graph vs. source graph Freebase University employ Written Work author Person same. As DBLP DBPedia Article author Author same. As Person prizes Prize 44

LOD Source Graph • Web data modeled as a set of interlinked data graphs • Each data graph represent a source • Data graph vs. schema graph vs. source graph Freebase DBLP DBPedia author sames same. As 45

Keyword Query Answers User information need „stanford Freebase … John. title type pub 2 employ per 2 name Stanford University pub 1 author same. As name John Mc. Carthy award“ DBLP Article uni 1 article pub 3 author per 1 author same. As name John Mccarthy DBPedia John Smith name per 4 per 3 prizes name John Mc. Carthy Music Award label prize 2 prize 1 label Turing Award 46

Problem Definition • Keyword query result (also called Steiner graph) is a subgraph of data graph that for every keyword, contains a matching data element (called keyword elements), and these elements are pairwise connected over a path. § d-max Steiner graph is a Steiner graph where paths between keyword elements is d-max or less. § Keyword query routing: compute valid set of data sources called keyword routing plan. A plan is valid if its sources can be combined to produce non-empty keyword query results.

A Valid Keyword Routing Plan User information need „stanford Freebase … John. title type pub 2 employ per 2 name Stanford University pub 1 author same. As name John Mc. Carthy award“ DBLP Article uni 1 article pub 3 author per 1 author same. As name John Mccarthy DBPedia John Smith name per 4 per 3 prizes name John Mc. Carthy Music Award label prize 2 prize 1 label Turing Award 48

• • • The Search Space Multi-level inter-relationship graphs capture the entire search space Relationships between elements and between different levels § A solution: apply existing approaches to keyword search for computing Steiner graphs § Steiner graphs might span several linked sources § Search space grow exponentially with the number of sources and their associated links § Search space is too large! 49

Keyword Sets • One keyword set for every data source • Elements stand for distinct keywords mentioned in a source Freebase DBLP … John. title uni 1 pub 2 pub 1 author per 2 same. As employ Stanford John name Stanford University John Mc. Carthy pub 3 author per 1 author same. As Mc. Carthy name John Mccarthy DBPedia John Smith name per 4 per 3 prizes John Music Award label prize 2 prize 1 Award name John Mc. Carthy label Turing Award 50

Element-level Keyword-Element Relationship Graph (E- KERG) A keyword-element captures a keyword k and the data elementioning k A relationship between two keyword-elements exists iff there is a path between their associated data elements In d-max KERG, the paths to be considered have length d-max or less • • • Freebase DBLP pub 4 … John. title uni 1 pub 2 pub 1 author per 2 uni 1 Stanford same. As employ per 2 per 1 John Mc. Carthy author same. As per 1 name Stanford University pub 3 DBPedia per 4 John Smith name per 4 John per 3 prizes per 3 Mc. Carthy name John Mccarthy prize 2 Music Award label Award prize 2 prize 1 John name John Mc. Carthy Award label Turing Turin Award

Schema-level Keyword-Element Relationship Graph (S-KERG) • • • A keyword-element captures a keyword k and the schema element which contains some instances (date elements) mentioning k A relationship between two keyword-elements exists if there is a path between some instances of their associated schema elements Groups ele. (rel. ) when they capture same keyword (rel. between same classes) Freebase DBLP Article pub 4 … John. title uni 1 pub 2 pub 1 author per 2 University uni 1 Stanford same. As employ Person per 2 author per 1 John Mc. Carthy author same. As Author per 1 name Stanford University pub 3 DBPedia Person per 4 John Smith name per 4 John per 3 prizes per 3 Mc. Carthy name John Mccarthy Prize prize 2 Music Award label Award prize 2 prize 1 John name John Mc. Carthy Award label Turing Turin Award

Data-Source-level Keyword-Element Relationship Graph (D-KERG) • • • A keyword-element captures a keyword k and the source which contains some instances (date elements) mentioning k A relationship between two keyword-elements exists if there is a path between some instances of their associated sources Groups ele. (rel. ) when they capture same keyword (rel. between same sources) Freebase DBLP Article pub 4 … John. title uni 1 pub 2 pub 1 author per 2 University uni 1 Stanford same. As employ Person per 2 author per 1 John Mc. Carthy author same. As Author per 1 name Stanford University pub 3 DBPedia Person per 4 John Smith name per 4 John per 3 prizes per 3 Mc. Carthy name John Mccarthy Prize prize 2 Music Award label Award prize 2 prize 1 John name John Mc. Carthy Award label Turing Turin Award

Routing Plan Computation • Keyword sets – Retrieve elements for every keyword k in K – Retrieve associated sources and put them into SK – Compute all |K|-combinations of SK (KRPs) § KERG models § Compute all 2 -combinations of K to get all keyword pairs § Retrieve matching KERG relationships for each pair and join them to produce matching subgraphs (KRPs) 54

Mixed, Corrective and Stream-based Linked Data Query Processing

Structured Query Processing on the Web • Linked data (schema and data are linked) • Federation • Query parts to sources • Combining results from different sources • Exploration • Mixed

Top-down Query Evaluation (Harth et al. , WWW 2010) • Local index of sources, assumed to be complete – Used for source selection – Maps triple and join patterns to source URIs • Statistics for ranking of sources and query optimization – Performed once at compile-time – Only a fixed number of top-ranked sources is considered • • No run-time discovery Fast, only relevant sources are retrieved Not up-to-date Index size may become very large

Bottom-up Query Evaluation (Hartig et al. , ISWC 2009) • Sources discovered at run-time through links from other, already retrieved sources • No local index of sources • Slower, as unnecessary sources are retrieved • Always up-to-date

Mixed Strategy (Ladwig et al. ISWC 2010) • Combination of top-down and bottom-up strategies – Partial local index of sources, not assumed to be complete – New sources are discovered at run-time • Addresses volume and dynamic of Linked Data • Corrective Source Ranking – Deal with heterogeneous source descriptions • Stream-based Query Processing – Deal with unpredictable nature of Linked Data access

Stream-based Query Processing • – Construct query plan – Probe local index for sources • Query Plan Join name(? y, ? n) Network latency – Do not block! – Evaluation driven by incoming data • Results Compile-time works. At(? x, dbpedia: KIT) Run-time – – Retrieve sources Push data into query plan Discover new sources Rank sources knows(? x, ? y) Samples Push Source Retrieval Source Retriever 1 Source Retriever 2. . . Linked Data Retrieve source Source discovered Source Ranker Source 1 (score: 1. 0) Source 2 (score: 0. 7). . . Local source index

Push-based Symmetric Hash Join t 7 t 4 • Operation t 7 t 5 – Maintains a hash table for each input – Tuples are inserted into one hash table and then the other is probed for join combinations • Results reported as soon as input tuples arrive • Tuples can arrive on all inputs in any order • Push-based Push output Left input Right input Key T a t 1, t 3 b t 4, t 5 c t 6 b t 2, 2 t 7 – Tuples are pushed into operators from the leaves to the root of the query plan Insert – Execution driven by incoming tuples instead of results Pushed on left: t 7(b) Probe

Corrective Source Ranking • Prefer more relevant sources • Relevancy of a source is based on – Current query – Any available intermediate results – Overall optimization goal • Define a set of source features and derive concrete source metrics – Not all metrics are available for all sources (heterogeneity) • Refine previously computed metrics using newly discovered information

Source Features and Metrics – Source is more relevant if it contains data that contributes to answers of the query – Triple Pattern Cardinality – Join Pattern Cardinality

Metric Correction and Refinement • During query processing new information becomes available: intermediate join results, links – Refine and correct previously computed metrics – Important in the case of non-discriminative patterns • Instantiate triple pattern of a join with samples of intermediate results to obtain better join size estimates • Example Intermediate results in SHJ operator le p Sam Perform triple pattern cardinality lookups

Conclusions • Query processing: which kinds of data, queries? – Focus: textual & structured queries and semantic data • Web of linked data creates opportunities and challenges – Optimization – Approximation – Routing – Top-k … and ranking • Web is linked data + a large amount of text – Hybrid management & integrated search

References • Thanh Tran, Günter Ladwig: Structure Index for RDF. Sem. Data@VLDB 2010 • Thanh Tran, Lei Zhang, Rudi Studer: Routing Keywords to Linked Data Sources, ISWC 2010 • Günter Ladwig, Thanh Tran: Linked Data Query Processing Strategies, ISWC 2010 • Andreas Harth, Katja Hose, Marcel Karnstedt, Axel Polleres, Kai-Uwe Sattler, Jürgen Umbrich: Data summaries for on-demand queries over linked data. WWW 2010: 411 -420 • Olaf Hartig, Christian Bizer, Johann Christoph Freytag: Executing SPARQL Queries over the Web of Linked Data. ISWC 2009: 293 -309