Integrating Keyword Search into XML Query Processing XML

  • Slides: 44
Download presentation
Integrating Keyword Search into XML Query Processing XML Query Language. By: (XML-QL) Presentation Extending

Integrating Keyword Search into XML Query Processing XML Query Language. By: (XML-QL) Presentation Extending XML-QL with Keyword Search Alex Kremer Extended XML-QL Implementation Ariel Rosenblatt Using RDBMS

Bibliography (well-formed, but invalid) n n n Bibliography Article elements are from different sources

Bibliography (well-formed, but invalid) n n n Bibliography Article elements are from different sources Same information, but using different XML Scheme / DTDs (Document Type Descriptors)

XML Queries n n XML is becoming the Data Storage and Exchange Format of

XML Queries n n XML is becoming the Data Storage and Exchange Format of choice in many applications Handling of XML data requires a rich and powerful Query Language n n Allow for querying the content and structure of an XML document Varying or unknown structures can make formulating queries very difficult

XML Queries: Why not SQL/OQL n n XML is not rigidly structured In XML

XML Queries: Why not SQL/OQL n n XML is not rigidly structured In XML the schema can exists with the data as tag names n n n If DTD is not available, schema is build while the document is parsed Missing elements or multiple occurrences of the same element This flexibility is crucial for EDI (Electronic Document Interchange)

XML Query Requirements W 3 C Working Group n Goals: n n Support different

XML Query Requirements W 3 C Working Group n Goals: n n Support different usage scenarios Define data model + query operators Define query language syntax Interoperate with other XML working groups

XML Query Requirements: Usage Scenarios n Human-readable documents n n Manuals, Books, Articles Data-oriented

XML Query Requirements: Usage Scenarios n Human-readable documents n n Manuals, Books, Articles Data-oriented documents n XML representation of: n n Database data, Object data, … XML representation might be either: n Physical or Virtual

XML Query Requirements: Usage Scenarios Contd. n Mixed model documents: n n n Hybrid

XML Query Requirements: Usage Scenarios Contd. n Mixed model documents: n n n Hybrid of document oriented and dataoriented Catalogues, Patient health records, … Administrative data: n Configuration files, User profiles, Administrative logs

XML Query Requirements: Usage Scenarios Contd. n Filtering streams: n n n On-line: filtering

XML Query Requirements: Usage Scenarios Contd. n Filtering streams: n n n On-line: filtering / extracting / transforming / routing, of XML data streams Logs of email messages, Network packets, Stock market data, Newswire feeds Document Object Model (DOM) n Perform queries on DOM structures to return sets of nodes that meet the specified criteria

XML Query Requirements: Usage Scenarios Contd. n Multiple syntactic environments for queries embedded in:

XML Query Requirements: Usage Scenarios Contd. n Multiple syntactic environments for queries embedded in: n n … URL, XML, JSP or ASP pages, a string in a general-purpose programming language

XML Query Requirements: Interoperability n n Results must be returned in a DOM compatible

XML Query Requirements: Interoperability n n Results must be returned in a DOM compatible manner XPath (used in XPointer and XSLT) n n XPath expressibility and search facilities should be used in query syntax Usage of XML Schema (XSDL) and/or DTD

XML Query Languages: Proposals to W 3 C n n XQL (heavily based on

XML Query Languages: Proposals to W 3 C n n XQL (heavily based on XPath) XML-QL

XML-QL n n It is declarative It is “relational complete”; in particular it can

XML-QL n n It is declarative It is “relational complete”; in particular it can express joins Simple enough to enable optimizations It can extract data from existing XML documents and construct new documents (transformations)

XML-QL: Syntax WHERE ( xml-pattern [ ELEMENT_AS $elem_var ] )* IN url, ( predicate

XML-QL: Syntax WHERE ( xml-pattern [ ELEMENT_AS $elem_var ] )* IN url, ( predicate )* CONSTRUCT xml-pattern | $variable n n WHERE clause specifies how to filter data from the input XML dataset CONSTRUCT clause specifies how to assemble the query results in XML

XML-QL: Example #1 WHERE <article> <author><name>$N</name></author> <title>$T</title> <article> ELEMENT_AS $E IN “bibliography. xml”, $N

XML-QL: Example #1 WHERE <article> <author><name>$N</name></author> <title>$T</title> <article> ELEMENT_AS $E IN “bibliography. xml”, $N like *Florescu* CONSTRUCT <result> $E </result> n Yields the following result

XML-QL Explained: The Data Model n n n A Set of XML documents must

XML-QL Explained: The Data Model n n n A Set of XML documents must be represented (XML Data Set) XML elements in a dataset can be partitioned according to their types Need to represent information in a lossless manner (original data set must be recreatable from the representation)

XML-QL Explained: Data Model Representation ID 00 Bibliography: article ID 01 ID 04 id

XML-QL Explained: Data Model Representation ID 00 Bibliography: article ID 01 ID 04 id link “ 3” date “ 20000815” “ 1” id title author link “http: …” source ID 05 title ID 02 “XML Query…” “http: …” ID 03 “W 3 C” ID 06 ID 07 name “A Query…” “Alon L…” “Daniela Florescu” article id ID 08 “ 4” link “http: …” ID 14 id “ 6” title author ID 09 “Integr…” ID 10 ID 12 name ID 11 ID 13 “Daniela Florescu” “Donald K…” “@article… Florescu… }”

XML-QL Explained: Data Model Representation n Dataset D is represented as a graph GD:

XML-QL Explained: Data Model Representation n Dataset D is represented as a graph GD: n Nodes: n n n Element e node Ne uniquely labeled IDe Data value v leaf Lv uniquely labeled v Edges: n n n (Ne , Ne’) labeled with the tag of e’, if e’ is directly nested within e (<e><e’>…</e’></e>) (Ne , Lv) labeled with “”, if v is directly contained within e (<e>v</e>) (Ne , Lv) labeled with attribute name a, if v is the value of atribute a of element e (<e a=“v”>…</e>)

XML-QL Explained: Query Processing n An XML pattern can be also modeled by a

XML-QL Explained: Query Processing n An XML pattern can be also modeled by a graph n n Some labels in the graph are now variables The result of the evaluation of query q on the input D, is: n n Each mapping from the graph Gq to the graph GD which preservers the constant labels This mapping induces a substitution of the variables in the query on the set of constant values

XML-QL Explained: A Query Graph for Example #1 WHERE <article> <author><name>$N</name></author> <title>$T</title> <article> ELEMENT_AS

XML-QL Explained: A Query Graph for Example #1 WHERE <article> <author><name>$N</name></author> <title>$T</title> <article> ELEMENT_AS $E IN “bibliography. xml”, $N like *Florescu* CONSTRUCT <result> $E </result> article title author name $T “*Florescu*”

XML-QL Explained: Query Processing, Example #1 ID 00 Bibliography: article ID 01 ID 04

XML-QL Explained: Query Processing, Example #1 ID 00 Bibliography: article ID 01 ID 04 id link No <author> “ 3” date “ 20000815” “ 1” id “http: …” “XML Query…” No <name> ID 06 ID 07 name “name” is name an attribute source ID 05 title ID 02 “http: …” title author link ID 03 “A Query…” “Alon L…” “Daniela Florescu” article id ID 08 “ 4” “http: …” article title $T “ 6” title author ID 09 ID 10 ID 12 name “Integr…” ID 11 name “Daniela Florescu” “W 3 C” link ID 014 id “@article… Florescu… }” ID 13 “Donald K…” Match! Add ID 08 to Results author $E = ID 08 $T = “Integrating Keyword Search…” name “*Florescu*”

XML-QL: Advanced Queries Example #2 (More Florescu) WHERE <article> <*><author><name>$N</name></author></*> <title>$T</title> <article> ELEMENT_AS $E

XML-QL: Advanced Queries Example #2 (More Florescu) WHERE <article> <*><author><name>$N</name></author></*> <title>$T</title> <article> ELEMENT_AS $E IN “bibliography. xml”, $N like *Florescu* CONSTRUCT <result> $E </result> union WHERE <article> <*><author><_ name=$N></_></author></*> <title>$T</title> <article> ELEMENT_AS $E IN “bibliography. xml”, $N like *Florescu* CONSTRUCT <result> $E </result> We now look for articles where the author name can be also an attribute!, result Back

XML-QL: Disadvantages n n n We need to know the XML structure in order

XML-QL: Disadvantages n n n We need to know the XML structure in order to query We can still perform more efficient queries, where we get all the information available, but These queries can easily grow very complex as seen previously

XML-QL: Keyword Search Extension n Addition of special predicate called contains to XML-QL n

XML-QL: Keyword Search Extension n Addition of special predicate called contains to XML-QL n n n Tests the existence of a given word within an XML element Works on partially known or not-known XML structure Allows querying several XML documents with different structure

Extended XML-QL: The contains Predicate n The contains predicate has 4 arguments, ($E, word,

Extended XML-QL: The contains Predicate n The contains predicate has 4 arguments, ($E, word, depth, location): n n $E is an XML element variable Word – the word we are searching for Depth is an integer expression limiting the depth at which the word is found within the element Location is a boolean expression over the set of constants, n {tag_name, attribute_name, content, attribute_value}

Extended XML-QL: Example #3 n We can use the extended XML-QL to formulate a

Extended XML-QL: Example #3 n We can use the extended XML-QL to formulate a query which yields the same result as Example #2 WHERE <article> <author></author> ELEMENT_AS $A <title>$T</title> <article> ELEMENT_AS $E IN “bibliography. xml”, contains($A, “Florescu”, 3, content or attribute_value) CONSTRUCT <result> $E </result> Back

Extended XML-QL: Example #4 n We are able to query unstructured data (full text

Extended XML-QL: Example #4 n We are able to query unstructured data (full text search) within a set of articles: WHERE <article></article> ELEMENT_AS $E IN “bibliography. xml”, contains($E, “Florescu”, 3, any) CONSTRUCT <result> $E </result> Yielding the result

Implementing the contains predicate n The authors suggest an implementation of the XML-QL extension

Implementing the contains predicate n The authors suggest an implementation of the XML-QL extension on top of a Commercial RDBMS: n Oracle 8, IBM DB 2, MS-SQL, …

Implementation Using RDBMS n Reasons: n n Easy to implement an extended XML query

Implementation Using RDBMS n Reasons: n n Easy to implement an extended XML query processor Universally available RDBMS allow to mix XML data and other (relational data) Very good performance over large volumes of data

Relational Support for Full-text Indexing n Use of extended Inverted Files to implement: n

Relational Support for Full-text Indexing n Use of extended Inverted Files to implement: n n n The contains predicate Finding of relevant XML data sources (URLs) in a distributed environment We will use RDBMS to implement Inverted Files

Inverting Files n For our needs the inverted file will contain tuples of the

Inverting Files n For our needs the inverted file will contain tuples of the following format: n n <word, el. ID, depth, location> Examples from bibliography. xml: n n n <“article”, el. ID 01, 0, tag> <“id”, el. ID 01, 1, attr> <“Requirements”, el. ID 01, 2, value>

Storing Inverted Files in RDBMS: Unique Internal el. IDs n Unique element IDs are

Storing Inverted Files in RDBMS: Unique Internal el. IDs n Unique element IDs are modeled as records containing: n n Document locators (URLs) Element locators within the document n n Using absolute positions (start, end) Using unique identifiers specified by DTD (explicit id attribute) CWhy not XPointer?

Storing Inverted Files in RDBMS: Unique el. ID Schemes n After normalization the authors

Storing Inverted Files in RDBMS: Unique el. ID Schemes n After normalization the authors propose the following scheme: n n n Elements(el. ID, docid, start_pos, end_pos, type, id_val) Documents(docid, URL) From this point el. ID can be used as an internal key used for faster processing

Storing Inverted Files in RDBMS n Natural way – using scheme: n n Huge!

Storing Inverted Files in RDBMS n Natural way – using scheme: n n Huge! We partition it into word tables for each keyword <word> in the dataset: n n contains(el. ID, word, depth, location) <word>(el. ID, depth, location) Virtually all IR (Information Retrieval) systems use partitioning by word Back

Storing Inverted Files in RDBMS: Further Partitioning n We use further partitioning to optimize

Storing Inverted Files in RDBMS: Further Partitioning n We use further partitioning to optimize the query processing: n The type (tag) of the element is usually known at predicate evaluation time n n by looking at the XML pattern of the query We further partition the individual <word> tables by the type of the element they are in: n <word>-<type>(el. ID, depth, location) n Table examples: Name-author, Florescu-name bibliography. xml Back

Implementation: Extended XML-QL Query Processing n Two Ways: n Replicating the whole XML data

Implementation: Extended XML-QL Query Processing n Two Ways: n Replicating the whole XML data in an RDBMS n n XML-QL processing is entirely performed in an RDBMS Distributed XML Query Processing n only index (contains) is stored in an RDBMS

Replicating the XML Data in an RDBMS n The binary table approach: n For

Replicating the XML Data in an RDBMS n The binary table approach: n For each type (tag name or attribute name), a table is built with the following scheme: n n n <type>(parent, element, value) The parent element contains the element of type <type> element is null if a <type> has no subelements or if <type> is an attribute name (in that case we are usually interested in the value) bibliography. xml

Replicating the XML Data in an RDBMS: XML-QL Queries n n Every XML-QL query

Replicating the XML Data in an RDBMS: XML-QL Queries n n Every XML-QL query can be translated into an equivalent SQL query The SQL query will process the binary tables of the replicated XML Data Back

XML-QL to SQL: Example #5 (from Example #1) WHERE <article> <author><name>$N</name></author> <title>$T</title> <article> ELEMENT_AS

XML-QL to SQL: Example #5 (from Example #1) WHERE <article> <author><name>$N</name></author> <title>$T</title> <article> ELEMENT_AS $E IN “bibliography. xml”, $N like *Florescu* CONSTRUCT <result> $E </result> SELECT article. element FROM article, author, name, title WHERE article. element = author. parent AND author. element = name. parent AND article. element = title. parent AND /* title exists */ name. value like “Florescu”

Extended XML-QL to SQL: Keyword Search n n n Processing the contains predicate involves

Extended XML-QL to SQL: Keyword Search n n n Processing the contains predicate involves usage of inverted file tables The word-type table has to be joined with the previous result The word-type table is the resulting table of the word by type partitioning

Extended XML-QL to SQL: Example #6 WHERE <article> <author></author> ELEMENT_AS $A <title>$Ttext</title> ELEMENT_AS $T

Extended XML-QL to SQL: Example #6 WHERE <article> <author></author> ELEMENT_AS $A <title>$Ttext</title> ELEMENT_AS $T <article> ELEMENT_AS $E IN “bibliography. xml”, contains($A, “Florescu”, 3, any) contains($T, “Integrating”, 3, any) CONSTRUCT <result> $Ttext </result> SELECT title. value FROM article, author, name, title, Florescu-author, Integrating-title WHERE article. element = author. parent AND author. element = Florescu-author. el. ID AND article. element = title. parent AND title. element = Integrating-title. el. ID

Distributed XML Query Processing n n XML data can be indexed in RDBMS, but

Distributed XML Query Processing n n XML data can be indexed in RDBMS, but The XML data cannot be stored in the RDBMS n n Reasons: volume (entire www) or legal The mediator (query interface): n n n Uses inverted files in RDBMS, but Accesses the data sources to compute the full query result (Expensive!) Load relevant documents/elements into RDBMS and process the query as described before (XML-QL to SQL)

Distributed XML Query Processing: Elements Retrieval n Use of Inverted Files for the retrieval

Distributed XML Query Processing: Elements Retrieval n Use of Inverted Files for the retrieval of relevant documents/elements: n n Evaluate contains predicates to disqualify irrelevant elements Further reduce the dataset needed to process the remaining basic XML-QL query n n This is an optimization since retrieval of remote data is expensive Load the relevant documents/elements

Distributed XML Query Processing: Reducing Retrieval WHERE <article> <author><name>$N</name></author> <title>$T</title> <article> ELEMENT_AS $E IN

Distributed XML Query Processing: Reducing Retrieval WHERE <article> <author><name>$N</name></author> <title>$T</title> <article> ELEMENT_AS $E IN “bibliography. xml”, $T like *XML* CONSTRUCT <result> $N </result> n Get the intersection of el. IDs sets from: n n author-article name-article title-article XML-article

Conclusions n n XML-QL can be extended to support keyword search Use of RDBMS:

Conclusions n n XML-QL can be extended to support keyword search Use of RDBMS: n n Inverted Files can be stored an queried using an RDBMS XML data itself can be replicated and queried in the RDBMS Keyword search and overall XML query processing can be carried out very efficiently Data structure influence: n n n The more structure is known, the faster a query will be executed Totally unstructured queries can be executed very fast The more structure is known, the higher is the quality of the query results