Using Semantics in XML Data Management Tok Wang

  • Slides: 52
Download presentation
Using Semantics in XML Data Management Tok Wang Ling Department of Computer Science National

Using Semantics in XML Data Management Tok Wang Ling Department of Computer Science National University of Singapore Gillian Dobbie Department of Computer Science University of Auckland April 9, 2007 SWIIS, Bangkok 1

Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for

Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) [6] 3. The applications of ORA-SS • Semantic query optimization in XML 4. Conclusion [6]. T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business media, Inc. 2005 April 9, 2007 SWIIS, Bangkok 2

Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for

Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3. The applications of ORA-SS • Semantic query optimization in XML 4. Conclusion April 9, 2007 SWIIS, Bangkok 3

1. XML – Brief introduction • XML (e. Xtensible Markup Language) is – Released

1. XML – Brief introduction • XML (e. Xtensible Markup Language) is – Released by W 3 C – An application of SGML – A promising standard of data publishing, integrating and exchanging on the web • XML schemas – DTD (Data Type Definition) [4] – XSD (XML Schema Definition), W 3 C recommended standard [8, 9, 10] [4]. Extensible Markup Language (XML) 1. 0 (3 rd Edition). W 3 C Recommendation 04 February 2004. http: //www. w 3. org/TR/2004/REC-xml-20040204/ [8]. XML Schema Part 0: Primer Second Edition. W 3 C Recommendation 28 October 2004. http: //www. w 3. org/TR/2004/REC-xmlschema-0 -20041028/ [9]. XML Schema Part 1: Structures Second Edition. W 3 C Recommendation 28 October 2004. http: //www. w 3. org/TR/2004/REC-xmlschema-1 -20041028/ [10]. XML Schema Part 2: Datatypes Second Edition. W 3 C Recommendation 28 October 2004. http: //www. w 3. org/TR/2004/REC-xmlschema-2 -20041028/ April 9, 2007 SWIIS, Bangkok 4

1. XML – A motivating example • Suppose we have an XML document “psj.

1. XML – A motivating example • Suppose we have an XML document “psj. xml” about different parts, suppliers and projects, where – The document has a root element psj; – Under psj, there is a sequence of part elements; – Under part, there is a sequence of supplier elements; – Under supplier, there is a sequence of project elements. April 9, 2007 SWIIS, Bangkok 5

Example 1. psj. xml <? xml version="1. 0" encoding="UTF-8"? > <psj xmlns: xsi="…" xsi:

Example 1. psj. xml <? xml version="1. 0" encoding="UTF-8"? > <psj xmlns: xsi="…" xsi: no. Namespace. Schema. Location="…"> <part> <pno>P 001</pno> <pname>Nut</pname> <color>Silver</color> <supplier> <sno>S 001</sno> <sname>Alfa</sname> <city>Atlanta</city> <price>5</price> <project> <jno>J 001</jno> <jname>Rocket boots</jname> <budget>20000</budget> <qty>60</qty> </project> <jno>J 003</jno> <jname>Firework launcher</jname> <budget>250000</budget> <qty>650</qty> </project> </supplier> <sno>S 002</sno> <sname>Beta</sname> <city>Atlanta</city> <city>New York</city> <price>5. 5</price> <project> <jno>J 002</jno> <jname>Diving helm</jname> <budget>18000</budget> <qty>70</qty> </project> <jno>J 003</jno> <jname>Firework launcher</jname> <budget>250000</budget> <qty>50</qty> </project> </supplier> </part> … … <part> <pno>P 002</pno> <pname>Nut</pname> <color>Copper</color> <supplier> <sno>S 001</sno> <sname>Alfa</sname> <city>Atlanta</city> <price>4. 6</price> <project> <jno>J 002</jno> <jname>Diving helm</jname> <budget>18000</budget> <qty>60</qty> </project> </supplier> <sno>S 003</sno> <sname>Beta</sname> <city>New York</city> <price>5</price> <project> <jno>J 001</jno> <jname>Rocket boots</jname> <budget>20000</budget> <qty>20</qty> </project> <jno>J 004</jno> <jname>Blue fireworks</jname> <budget>20000</budget> <qty>50</qty> </project> </supplier> </part> </psj> Figure 1. Example XML document April 9, 2007 SWIIS, Bangkok 6

1. XML – the DTD of the “psj. xml” <? xml version="1. 0" encoding="UTF-8"?

1. XML – the DTD of the “psj. xml” <? xml version="1. 0" encoding="UTF-8"? > <!--DTD generated by XXX--> <!ELEMENT psj (part+)> <!ELEMENT part (pno, pname, color, supplier+)> <!ELEMENT pno (#PCDATA)> <!ELEMENT pname (#PCDATA)> <!ELEMENT color (#PCDATA)> <!ELEMENT supplier (sno, sname, city+, price, project+)> <!ELEMENT sno (#PCDATA)> <!ELEMENT sname (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT project (jno, jname, budget, qty)> <!ELEMENT jno (#PCDATA)> <!ELEMENT jname (#PCDATA)> <!ELEMENT budget (#PCDATA)> <!ELEMENT qty (#PCDATA)> (a) “psj. dtd”, The DTD of the “psj. xml” ▼♦ psj ▼♦ part ♦ pno ♦ pname ♦ color ▼♦ supplier ♦ sno ♦ sname ♦ city ♦ price ▼♦ project ♦ jno ♦ jname ♦ budget ♦ qty (b) psj. dtd in Data Guide Figure 2. DTD and Data. Guide of Example XML document April 9, 2007 SWIIS, Bangkok 7

1. XML – what the DTD says • DTD is a simple definition of

1. XML – what the DTD says • DTD is a simple definition of an XML document, where users can define – Element/Attribute types – Occurrence constraints (e. g. ? , +, *) – Containment among different element types (the structure) • DTD cannot express – Occurrence constraints in numbers (e. g. 2 to 8) – Uniqueness/Key constraints on a combination of attributes/elements (ID attribute can be only assigned on one attribute at a time in DTD. ) – Relationship types among elements and their degrees – Difference between the attribute (or simple element ) of element type and the attribute (or simple element) of relationship type. Simple elements are those element types with PCDATA only without any attribute types. April 9, 2007 SWIIS, Bangkok 8

1. XML – XSD “psj. xsd”, the XSD schema of the motivating example data.

1. XML – XSD “psj. xsd”, the XSD schema of the motivating example data. XSD definition of element occurrence constraint XSD definition of key constraint, which requires that all part element should have a non-nil pno element and the value of all pno elements in the document should be unique. Figure 3. XML Schema of Example XML document April 9, 2007 <xs: schema xmlns: xs = “…”> <xs: element name = “psj”> <xs: complex. Type> <xs: sequence> <xs: element name="part"> <xs: complex. Type> <xs: sequence> <xs: element name="pno" type="xs: string"/> <xs: element name="pname" type=" xs: string"/> <xs: element name="color" type=" xs: string"/> <xs: element name="supplier" max. Occurs="unbounded"> <xs: complex. Type> <xs: sequence> <xs: element name="sno" type=" xs: string"/> <xs: element name="sname" type=" xs: string"/> <xs: element name="city" type=" xs: string“ max. Occurs="unbounded"/> <xs: element name="price" type=" xs: string"/> <xs: element name="project" max. Occurs="unbounded"> <xs: complex. Type> <xs: sequence> <xs: element name="jno" type=" xs: string"/> <xs: element name="jname" type=" xs: string"/> <xs: element name="budget" type=" xs: string"/> <xs: element name="qty" type=" xs: string"/> </xs: sequence> </xs: complex. Type> </xs: element> </xs: sequence> </xs: complex. Type> <xs: key name="PK"> <xs: selector xpath="part"/> <xs: field xpath="pno"/> </xs: key> </xs: element> </xs: schema> SWIIS, Bangkok 9

1. XML – what XSD can tell • XSD is the standard of XML

1. XML – what XSD can tell • XSD is the standard of XML schema definition, recommended by W 3 C and supported by most vendors, which – has extensible XML syntax, – supports more data types (user-defined type and 37 built-in types) – is able to represent uniqueness/key for both attribute types and element types. – And has many other improvements in comparison with DTD. April 9, 2007 SWIIS, Bangkok 10

1. XML – XSD still flaws XSD is not sufficient in expressing the relational

1. XML – XSD still flaws XSD is not sufficient in expressing the relational semantics in XML data, such as: 1. A key constraint is specified by a key element. The key constraints in XSD is an extension of ID in DTD. It is totally different to the key constraint in relational databases. – E. g. In the previous XSD, the values of key attribute, pno of part, should be unique within the set of the part elements in the whole document. – Therefore, when an element type is located in a lower level such as supplier and project, XSD cannot declare sno and jno as their key attributes (OIDs) respectively. April 9, 2007 SWIIS, Bangkok 11

1. XML – XSD still flaws (cont. ) - The key element must contain

1. XML – XSD still flaws (cont. ) - The key element must contain the following (in order): a) One and only one selector element - contains an XPath expression that specifies the set of elements across which the values specified by the field must be unique b) One or more field elements - contain an XPath expressions that specifies the values must be unique for the set of elements specified by the selector element. - The key constraint is similar to the unique constraint, except that the column on which a unique constraint is defined can have null values. April 9, 2007 SWIIS, Bangkok 12

1. XML – XSD still flaws (Cont. ) 2. XSD does not support relationship

1. XML – XSD still flaws (Cont. ) 2. XSD does not support relationship types and other relational semantic constraints. – E. g. The ternary relationship type psj among part, supplier and project in the original data is lost in the XSD. 3. XSD cannot distinguish attributes (or simple elements) of relationship types from those attributes (or simple elements) of element types. – E. g. Price is an attribute of the binary relationship type ps between part and supplier. However, it looks the same as sname, an attribute (simple element) of the element supplier. April 9, 2007 SWIIS, Bangkok 13

Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for

Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3. The applications of ORA-SS • Semantic query optimization in XML 4. Conclusion April 9, 2007 SWIIS, Bangkok 14

2. ORA-SS in a nutshell • ORA-SS is a semantics rich data model for

2. ORA-SS in a nutshell • ORA-SS is a semantics rich data model for semistructured data. • It can easily represent the relational semantics and constraints in XML data. • ORA-SS model is also a bridge that connects the tree structure of XML and the semantics in relational and object-relational databases. • In comparison with traditional ER diagram, ORA-SS schema diagram represents the hierarchical structure of XML data. April 9, 2007 SWIIS, Bangkok 15

2. ORA-SS in a nutshell • A complete ORA-SS model has 4 diagrams –

2. ORA-SS in a nutshell • A complete ORA-SS model has 4 diagrams – Schema diagram • Represents the structure and constrains (business rules) on XML documents – Instance diagram • Visually represents the graphical structure of XML data – Functional dependency diagram • Represents FDs in relationship types – Inheritance diagram • Represents the specialization/generalization relationships among different object classes in ORA-SS April 9, 2007 SWIIS, Bangkok 16

2. ORA-SS data models • Object class – attributes of object class – ordering

2. ORA-SS data models • Object class – attributes of object class – ordering on object class • Relationship Type – – – degree of relationship type participating object classes in relationship type attributes of relationship type disjunctive relationship type recursive relationship type ID dependent relationship type April 9, 2007 SWIIS, Bangkok 17

2. ORA-SS data models (Cont. ) • Attribute – attributes of object class or

2. ORA-SS data models (Cont. ) • Attribute – attributes of object class or relationship type – key attribute (OID) – foreign key / referential constraint (IDREF/IDREFS) – composite attribute – disjunctive attribute – attribute with unknown structure – ordering on attributes – fixed or default value of attribute – derived attribute April 9, 2007 SWIIS, Bangkok 18

The ORA-SS schema diagram of Example 1. Part, supplier and project are modeled as

The ORA-SS schema diagram of Example 1. Part, supplier and project are modeled as object classes. PS is a binary relationship type between part and supplier, part PS, 2, +, + pno pname supplier color PS sno sname + city price PSJ, 3, +, + project PSJ Pno, sno and jno are declared as the object ID of part, supplier and project respectively. April 9, 2007 jno jname budget PSJ is a ternary relationship type defined among part, supplier and project qty Price is an attribute of the relationship type PS; and qty is an attribute of PSJ. Figure 4. ORA-SS schema diagram of Example XML document SWIIS, Bangkok 19

ORA-SS – Semantic Advantages • ORA-SS can represent the following semantics that DTD and

ORA-SS – Semantic Advantages • ORA-SS can represent the following semantics that DTD and XMLSchema cannot: – Attribute vs. object class – Multi-valued attribute vs. object class – Identifier (ID) – IDREF or Foreign Key – n-ary relationship type – Attribute of object class vs. attribute of relationship type – View of XML document April 9, 2007 SWIIS, Bangkok 20

Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for

Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3. The applications of ORA-SS • Semantic query optimization in XML 4. Conclusion April 9, 2007 SWIIS, Bangkok 21

3. ORA-SS applications • Due to the rich semantics in ORA-SS, the model can

3. ORA-SS applications • Due to the rich semantics in ORA-SS, the model can be widely used in – – – – – Normal form XML schema Relational/object-relational storage of XML data XML schema/data integration XML query optimization [12] XML aggregates evaluation XML view creation and validation [2] XML graphical query language and output [7] XML keyword search [13] etc. We will illustrate these with in details [2]. Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER 2002, Tampere, Finland. Oct 7 -11, 2002 [7]. W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA 2003. [12]. H. Wu, T. W. Ling, B. Chen. VERT: a semantic approach for content search and content extraction in XML query processing. Submitted to ER’ 07 [13]. B. Chen, J. Lu, T. W. Ling. ICRA: effective semantics for ranked XML keyword search. Submitted to VLDB’ 07. April 9, 2007 SWIIS, Bangkok 22

3. ORA-SS applications Semantic query optimization • The semantic information represented in ORA-SS is

3. ORA-SS applications Semantic query optimization • The semantic information represented in ORA-SS is helpful in optimizing XML query. – There are many algorithms proposed for XML query optimization, e. g. Twig. Stack [1] and its variants. – When ORA-SS semantics of the data are known, they can be taken into account for query optimization. [1]. Nicolas Bruno, Nick Koudas, and Divesh Srivastava. Holistic Twig Joins: optimal XML Pattern Matching. SIGMOD Conference, 2002. April 9, 2007 SWIIS, Bangkok 23

3. ORA-SS applications Semantic query optimization Example: Consider the following simple query example which

3. ORA-SS applications Semantic query optimization Example: Consider the following simple query example which means, (Query 1) To display the budget of project “J 001”. //project [jno = “J 001”]/budget • Traditional processing should scan the whole XML document, checking every project with jno=“J 001” and finding all corresponding budget values. • However, in ORA-SS, since jno is the object ID and we have the functional dependecny: jno budget so the optimized processing only need to find the first project instance with jno=“J 001” and return the corresponding budget value. April 9, 2007 SWIIS, Bangkok 24

3. ORA-SS applications Semantic query optimization – Content Search • Most existing algorithms focus

3. ORA-SS applications Semantic query optimization – Content Search • Most existing algorithms focus on structural search of twig pattern queries • Few of them pay high attentions on content search for values of elements. • They treat content nodes (or values) the same as element nodes • Disadvantages: – Too many label streams of contents – Difficult to find the actual values of labels as output solutions • We propose VERT (Value Extraction with Relational Table) April 9, 2007 SWIIS, Bangkok 25

3. ORA-SS applications Semantic query optimization – • Content Search Idea of VERT: 1.

3. ORA-SS applications Semantic query optimization – • Content Search Idea of VERT: 1. Introduce relational tables to store document values instead of treating them as nodes and labeling them. 2. Rewrite and optimize XML twig queries based on underlining relational tables. 3. Further optimize relational tables for query processing if more semantic information is available (i. e. more semantics better optimization). April 9, 2007 SWIIS, Bangkok 26

3. ORA-SS applications Semantic query optimization – Content Search 1. Introduce relational tables to

3. ORA-SS applications Semantic query optimization – Content Search 1. Introduce relational tables to store document values instead of treating them as nodes and labeling them. E. g. the values for price (title, etc) of XML tree in Figure 5 can be stored with the labels of price (title, etc) elements in Figure 6. Figure 5. Example XML document 2 April 9, 2007 SWIIS, Bangkok Figure 6. Example VERT tables 27

3. ORA-SS applications Semantic query optimization – Content Search 2. Rewrite and optimize XML

3. ORA-SS applications Semantic query optimization – Content Search 2. Rewrite and optimize XML twig queries based on underlining relational tables. e. g. – – – Rewrite the twig query in Figure 7(a) to the twig in Figure 7(b) Execute SQL in table Rprice of Figure 6 to get all labels of price elements with value greater than 15 and form the stream Tprice>15 Perform structural joins based on these labels for price elements (i, e. Tprice>15 ) with book and ISBN elements Benefits: • • Save stream merging of all price elements with values > 15 Save structural join between price elements and their values (a) Twig query (b) rewritten query Figure 7. Example twig query April 9, 2007 SWIIS, Bangkok 28

3. ORA-SS applications Semantic query optimization – 3. Content Search Further optimize relational tables

3. ORA-SS applications Semantic query optimization – 3. Content Search Further optimize relational tables for query processing if some more semantic information is available (i. e. more semantics better optimization). Optimization 1 (VERT-1): put the value of price (title, etc) with labels of book objects since price (title) is a property of book object class according to semantics captured in ORA-SS (shown in Figure 8). Benefit: Further save structural joins between price and book & between ISBN and book for query in Figure 7 Figure 8. VERT tables with optimization 1 April 9, 2007 SWIIS, Bangkok 29

3. ORA-SS applications Semantic query optimization – 3. Content Search Further optimize relational tables

3. ORA-SS applications Semantic query optimization – 3. Content Search Further optimize relational tables for query processing if some more semantic information is available (i. e. more semantics better optimization). Optimization 2 (VERT-2): pre-merge the tables of title, price, etc. in Figure 8 if we further know they are single-valued attributes of book object class according to semantics in ORA-SS (shown in Figure 9). (Note: should not merge multi-valued attribute, author. ) Benefit: Save expensive structure joins by using an efficient selection on the table for query in Figure 7. April 9, 2007 Figure 9. VERT tables with optimization 2 SWIIS, Bangkok 30

3. ORA-SS applications Semantic query optimization – Content Search Experimental results on three datasets

3. ORA-SS applications Semantic query optimization – Content Search Experimental results on three datasets i. e. NASA, DBLP and XMark (Figure 10) • VERT outperforms Twig. Stack in query processing time • VERT-2 is superior to VERT-1, which is in turn better than original VERT. Figure 10. Experimental results of VERT April 9, 2007 SWIIS, Bangkok 31

3. ORA-SS applications XML query with aggregates • XML semantics captured in ORA-SS are

3. ORA-SS applications XML query with aggregates • XML semantics captured in ORA-SS are crucial in correctly writing queries with aggregates Example. Consider the query: (Query 3. ) Find the average budget of all the projects. Two potential XQuery expressions are: : XQ. 3 a XQ. 3 b for $pid in distinct_values(//project/jno) let $bgts : = //project/budget let $bgts : = //project[jno = $pid]/budget return <avg_bgt>{avg($bgts)} </avg_bgt> April 9, 2007 SWIIS, Bangkok 32

3. ORA-SS applications XML query with aggregates Example - cont. • If we know

3. ORA-SS applications XML query with aggregates Example - cont. • If we know jno is the OID or key of project object class from ORA-SS, i. e. jno budget then we can easily judge that XQ. 3 a is a correct Xquery expression while XQ 3. b is incorrect as some projects may appear more times than other projects in the XML document. • If we don’t know this semantics, it is difficult to say which XQuery expression is correct. April 9, 2007 SWIIS, Bangkok 33

3. ORA-SS applications Define and validate XML views • Valid XML views in ORA-SS

3. ORA-SS applications Define and validate XML views • Valid XML views in ORA-SS • View definition operators: select, project/drop, swap, join For example, consider the following swapping operation that changes the position of supplier and part in different hierarchical levels: Because price is a relationship attribute, it cannot be moved up with supplier elements, which would be semantically meaningless in the result view. April 9, 2007 Valid view Invalid view Figure 11. Example view definition 1 SWIIS, Bangkok 34

3. ORA-SS applications Define and validate XML views Another example, consider the following projection

3. ORA-SS applications Define and validate XML views Another example, consider the following projection operation that drops supplier from the structure: part project price qty Invalid view Dropping supplier makes price and qty become multi-valued attributes, and we should apply aggregation functions to get a meaningful view. April 9, 2007 Valid view Figure 12. Example view definition 2 SWIIS, Bangkok 35

3. ORA-SS applications Graphical XML query based on ORA-SS A graphical XML query language

3. ORA-SS applications Graphical XML query based on ORA-SS A graphical XML query language is designed on the base of ORA-SS Query 1: To select and display the projects that do not have any suppliers located in Atlanta. The schema panel loads the ORA-SS schema diagram Graphical query can be posed by either dragging components from the diagram in schema panel or using the construction buttons on the top of the window. Complex query logics such as quantification, negation, IF-THEN construction can be specified in the Condition Logic Window Figure 13. The screenshot of the user-interface of our graphical query language April 9, 2007 SWIIS, Bangkok 36

3. ORA-SS applications XML keyword search with semantics • Keyword search is a user-friendly

3. ORA-SS applications XML keyword search with semantics • Keyword search is a user-friendly way to query XML documents. • Most existing algorithms are based on either tree data model or graph (digraph) data model of XML without the semantics. April 9, 2007 SWIIS, Bangkok 37

3. ORA-SS applications XML keyword search with semantics • Tree data model (LCA [11])

3. ORA-SS applications XML keyword search with semantics • Tree data model (LCA [11]) – Lowest Common Ancestor (LCA) • Contains the all keywords • Has no descendant node containing all the keywords • Graph (digraph) data model (Banks [5]) – Reduced sub-tree • A tree T in graph (digraph) containing all keywords • No proper sub-tree of T contains all keywords • Limitations of keyword search without semantics – May have difficulty in representing results – May return many irrelevant results [5]. V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion for keyword search on graph databases. In Proc. of VLDB Conference, pages 505 -516, 2005. [11] Y. Xu and Y. Papakonstantinou. Efficient keyword search for smallest LCAs in XML databases. In Proc. of SIGMOD Conference, pages 537 -538, 2005. April 9, 2007 SWIIS, Bangkok 38

3. ORA-SS applications XML keyword search with semantics Example: • Q 1 = {Widom}

3. ORA-SS applications XML keyword search with semantics Example: • Q 1 = {Widom} • LCA & reduced sub-tree give node 1. 1. 1 • Not enough information Figure 14. Example XML document 3 • Q 2 = {semistructured query processing} • LCA(Q 2) = dblp (i. e. the whole XML database) …overwhelming information • Reduced sub-tree results includes all papers with either “semistructured” or “query processing”. However, not all “query processing” papers are about “semistructured”. April 9, 2007 SWIIS, Bangkok 39

3. ORA-SS applications XML keyword search with semantics • Therefore, we propose ICA (Interested

3. ORA-SS applications XML keyword search with semantics • Therefore, we propose ICA (Interested Common Ancestor) and IRA (Interested Related Ancestors) to exploit the semantics for ranked keyword search. • Ideas: 1. DBA Defines the set of interested object classes and the conceptual connections between objects. e. g. in DBLP publications and author can be the interested object classes; the reference/citations can be one type of conceptual connection between publications. Note: we can group all publications for each author object. April 9, 2007 SWIIS, Bangkok 40

3. ORA-SS applications XML keyword search with semantics • Ideas: 2. The results of

3. ORA-SS applications XML keyword search with semantics • Ideas: 2. The results of a keyword query include interested objects based on ICA and IRA semantics. – – The results of ICA (Interested Common Ancestor) include all objects that each contains all query keywords The results of IRA (Interested Related Ancestors) include all object pairs (o, o’) such that – the pair together contain all keywords AND – o and o’ are conceptually connected. Note: we output a list of IRA objects instead of IRA pairs. Intuitive meaning for IRA: For query “semistructured query processing”, if a paper P with title “query processing” cites or is cited by a paper with title “semistructured”, then P is considered related to the query; at least it is a better result than “query processing” papers that do not cite or are cited by “semistructured” papers. April 9, 2007 SWIIS, Bangkok 41

3. ORA-SS applications XML keyword search with semantics • Ideas: 3. The system automatically

3. ORA-SS applications XML keyword search with semantics • Ideas: 3. The system automatically ranks result objects based on the following metrics for output. – Relevance. Rank: Intuitive meaning: – – Keyword Proximity Ranks (Prox. Rank): – April 9, 2007 for query “semistructured query processing”, given two papers P 1 and P 2 containing “query processing”, if P 1 cites or is cited by many “semistructured” papers whereas P 2 cites or is cited by few “semistructured” papers, then P 1 is considered more relevant to the query. Intuition: The less the number of elements in one object that directly contain all keywords, the better result the object is. SWIIS, Bangkok 42

3. ORA-SS applications XML keyword search with semantics Experimental evaluation based on DBLP •

3. ORA-SS applications XML keyword search with semantics Experimental evaluation based on DBLP • Our approach outperforms most existing academic demos in both execution time and result quality Figure 15. Execution time Figure 16. Comparisons of relevant result in top-10, 20, 30 answers among academic demos April 9, 2007 SWIIS, Bangkok 43

3. ORA-SS applications XML keyword search with semantics Experimental evaluation based on DBLP •

3. ORA-SS applications XML keyword search with semantics Experimental evaluation based on DBLP • Our approach is comparable or superior to commercial systems, Google Scholar and Microsoft Libra, in term of result quality even though they can search in much more web data. Figure 17. Comparisons of relevant result in top-10, 20, 30 answers with commercial systems April 9, 2007 SWIIS, Bangkok 44

3. ORA-SS applications XML keyword search with semantics A demo prototype of our keyword

3. ORA-SS applications XML keyword search with semantics A demo prototype of our keyword search system on DBLP data is available at http: //xmldb. ddns. comp. nus. edu. sg Figure 18. User interface of the demo system April 9, 2007 SWIIS, Bangkok 45

Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for

Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3. The applications of ORA-SS • Semantic query optimization in XML 4. Conclusion April 9, 2007 SWIIS, Bangkok 46

4. Conclusion 1. We demonstrate a data-centric XML document and show the limitations of

4. Conclusion 1. We demonstrate a data-centric XML document and show the limitations of current XML schema standard in represent relational semantics and constraints. April 9, 2007 SWIIS, Bangkok 47

4. Conclusion 2. We have shown that semantics in XML data are crucial in

4. Conclusion 2. We have shown that semantics in XML data are crucial in many applications, such as • • XML query optimization for content search XML aggregate computation XML view creation and validation XML graphical query language and output XML keyword search etc. April 9, 2007 SWIIS, Bangkok 48

4. Conclusion 3. Many semantic information of XML data can be expressed in ORA-SS,

4. Conclusion 3. Many semantic information of XML data can be expressed in ORA-SS, which is a semantics rich data model, but not in DTD or XML Schema. April 9, 2007 SWIIS, Bangkok 49

References: [1] [2]. [3]. [4]. [5]. [6]. [7]. [8]. [9]. [10]. [11] [12]. [13].

References: [1] [2]. [3]. [4]. [5]. [6]. [7]. [8]. [9]. [10]. [11] [12]. [13]. Nicolas Bruno, Nick Koudas, and Divesh Srivastava. Holistic Twig Joins: optimal XML Pattern Matching. SIGMOD Conference, 2002. Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER 2002, Tampere, Finland. Oct 7 -11, 2002 C. J. Date. An Introduction to Database Systems. 3 rd edition, Addison-Wesley Publishing Company (1981). Extensible Markup Language (XML) 1. 0 (3 rd Edition). W 3 C Recommendation 04 February 2004. http: //www. w 3. org/TR/2004/REC-xml-20040204/ V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion for keyword search on graph databases. In Proc. of VLDB Conference, pages 505 -516, 2005. T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business media, Inc. 2005 W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA 2003. XML Schema Part 0: Primer Second Edition. W 3 C Recommendation 28 October 2004. http: //www. w 3. org/TR/2004/REC-xmlschema-0 -20041028/ XML Schema Part 1: Structures Second Edition. W 3 C Recommendation 28 October 2004. http: //www. w 3. org/TR/2004/REC-xmlschema-1 -20041028/ XML Schema Part 2: Data types Second Edition. W 3 C Recommendation 28 October 2004. http: //www. w 3. org/TR/2004/REC-xmlschema-2 -20041028/ Y. Xu and Y. Papakonstantinou. Efficient keyword search for smallest LCAs in XML databases. In Proc. of SIGMOD Conference, pages 537 -538, 2005. H. Wu, T. W. Ling, B. Chen. VERT: a semantic approach for content search and content extraction in XML query processing. Submitted to ER’ 07 B. Chen, J. Lu, T. W. Ling. ICRA: effective semantics for ranked XML keyword search. Submitted to VLDB’ 07. April 9, 2007 SWIIS, Bangkok 50

Q&A April 9, 2007 SWIIS, Bangkok 51

Q&A April 9, 2007 SWIIS, Bangkok 51

The End April 9, 2007 SWIIS, Bangkok 52

The End April 9, 2007 SWIIS, Bangkok 52