Flexible Queries over Semistructured Data Yaron Kanza Yehoshua
Flexible Queries over Semistructured Data Yaron Kanza Yehoshua Sagiv The Hebrew University 1 PODS 2001
Overview of the Talk • New semantics for queries over semistructured data • New results for – Query evaluation – Query equivalence – Database equivalence (databases could be equivalent even if they are not identical!) – Transforming a database into a tree 2 PODS 2001
Why is it Difficult to Formulate Queries over Semistructured Data? The structure of the The description of Data is contributed Data does not conform by manychanges users database the schema is large to a rigid schema in a variety ofof designs (e. g. , afrequently DTD XML) 3 It. The Queries is difficult queryshould to use It is difficult to deal the bewith schema rewritten different for design queries formulating structures frequently of queries data PODS 2001
A University Scenario Prof. C. Katz teaches the Compilers The T. A. of the Logic OS Course and the Databases Course Information about the Personal information + Logic OS Course courses he teaches University Website Database 4 PODS 2001
Database • Following OEM, the database is represented as a rooted labeled directed graph 5 PODS 2001
University 1 Course Teacher 2 Teacher 5 Name 12 A. Cohen 3 4 Teacher Title 6 7 8 Course Name Course 9 Logic Name OS C. Katz 13 B. Levi 10 Title 11 Title 14 15 Compilers Databases Thus, it isabout difficult write Data thecan to A teacher node either be Data about Prof. Data about the OS Course a query that looks for all Katz and the two below or above a course node Logic Course courses she teaches the teachers and their courses 6 PODS 2001
Queries • Queries are represented as rooted labeled directed graphs • The nodes of the graph are considered as variables 7 PODS 2001
University r Course u Teacher v Teacher w Name y However, if in the database, courses A Instead, query that we propose finds allnew pairs ways of are descendents of teachers, courses of matching taughtqueries by the to same databases teacher the query has to be reformulated 8 PODS 2001
A Rigid Matching • The query root is mapped to the db root Query Root r 1 Database Root • A query edge with label l is mapped to a db edge with label l (and, hence, a path is mapped to a path) x l 9 l y 11 • It is the usual semantics for queries (e. g. , Lorel, XML-QL, XQL, etc. ) 9 PODS 2001
University u 1 Course u Course Teacher w 2 3 4 v Teacher Title 5 v 6 7 8 A. Cohen v Course Name Course 9 Logic Name OS C. Katz Name 12 Teacher 13 B. Levi 10 w Title 11 w Title 14 15 Compilers Databases AAnother Rigid Matching Example This is not. Matching a. Rigid rigid matching! 10 PODS 2001 Course w
A Semiflexible Matching Query • The query root is Root The last to two mapped theconditions db root r l 1 cannot benode verified • A query with anlocally, incoming l is i. e. , by label considering mapped to edge a db node one query at a time x l 2 with an incoming label l 3 • The image of every y query path is embedded in some database path • SCC is mapped to SCC 11 PODS 2001 DB Root 1 × l 3 9 l l 1 11 l 2
University u 1 Course u Course Teacher w 2 w 3 4 v Teacher Title 5 v 6 7 v 8 A. Cohen v Course Name Course 9 Logic Name OS C. Katz Name 12 Teacher 13 B. Levi 10 w Title 11 w Title 14 15 Compilers Databases We A Semiflexible get all the teacher-course Matching Example pairs 12 PODS 2001 Course w
University u 1 Course Teacher 2 Teacher 5 u 3 w 4 Teacher Title 6 7 8 A. Cohen 13 13 B. Levi Course v Course Name Course 9 Logic Name OS C. Katz Name 12 Course 10 v Title Teacher x Teacher 11 x Title 14 15 Compilers Databases w Impossible to get this pair by means The SF Another matching Example gives of a pair a of of a rigid matching, since the courses. Semiflexible taught by the Matching same teacher query is a dag and the db is a tree PODS 2001
A Flexible Matching • The query root is mapped to the db root • A query node with an incoming label l is mapped to a db node with an incoming label l • An edge is mapped to two nodes on one path • Notice that a path in the query is not necessarily mapped to a path in the db 14 PODS 2001 Query Root r l 1 DB Root 1 l 2 xl 2 l 9 l yl 3 l 1 11 l 3
University u 1 Course Teacher 2 Teacher 5 u 3 7 8 A. Cohen 13 B. Levi v Course Name Course 9 y 10 v Logic Name OS C. Katz Name 12 Course w 4 Teacher Title 6 Course Title Teacher x Teacher 11 x Title 14 15 Compilers Databases w Name y query edge is mapped to two This. Aflexible matching is neither a rigid A Flexible Matching Example dbnor nodes on one path matching a semiflexible matching 15 PODS 2001
Differences Between the Semiflexible and Flexible Semantics • On a technical level, in flexible matchings – Query paths are not necessarily embedded in database paths – SCC’s are not necessarily mapped to SCC’s • On a conceptual level, in the semiflexible semantics, nodes are “semantically related” if they are on the same path, and hence – Query paths are embedded in database paths • In the flexible semantics, this condition is relaxed: – Query edges are embedded in database paths 16 PODS 2001
Inclusion • Proposition: R-MATQ(D) SF-MATQ(D) where • R-MATQ(D) is the set of rigid matchings • SF-MATQ(D) is the set of semiflexible matchings • F-MATQ(D) is the set of flexible matchings 17 PODS 2001
Verifying that Mappings are Semiflexible Matchings • Is a given mapping of query nodes to database nodes a semiflexible matching? – Not as simple as for rigid matchings (no local test, i. e. , need to consider paths rather than edges) • In a dag query, the number of paths may be exponential – Yet, verifying is in polynomial time • In a cyclic query, the number of paths may be infinite – Yet, verifying is in exponential time 18 PODS 2001
Verifying that a Mapping is a Semiflexible Matching Query / Database Path Query Tree Query DAG Query Cyclic Query Path Database PTIME No matchings Tree Database PTIME DAG Database PTIME matchings Cyclic Database PTIME co. NP 19 PODS 2001 No matchings No
Complexity of Query Evaluation • Not surprisingly, for both the semiflexible and flexible semantics – Data complexity is polynomial – Query complexity is exponential But is it exponential because the result is large or because the result is hard to compute? 20 PODS 2001
Input-Output Complexity of Query Evaluation for the Semiflexible Semantics • The input consists of both the query and the database • The input-output complexity is a function of the query, the database and the result • Next slide summarizes results about the input -output complexity – Polynomial for a dag query and a tree database (or simpler cases) • Rather difficult to prove, even when the query is a tree, since there is no local test for verifying that mappings are semiflexible matchings – Exponential lower bounds for other cases 21 PODS 2001
I/O Complexity for SF Semantics (lower bounds are for non-emptiness) Query / Database Path Database Tree Database Path Query PTIME Tree Query PTIME DAG Query Cyclic Query PTIME Result is empty DAG NPNPNPDatabase Complete Result is empty Cyclic NPNPDatabase Complete NP-Hard (in P 2) 22 PODS 2001
I/O Complexity of Query Evaluation for the Flexible Semantics • Results follow from a reduction to query evaluation under the rigid semantics • Tree query – Input-Output complexity is polynomial • DAG query – Testing for non-emptiness is NP-Complete 23 PODS 2001
Query Containment • Q 1 Q 2 if for all database D, the set of matchings of Q 1 w. r. t. to D is contained in the set of matchings of Q 2 w. r. t. to D • We assume that – Both queries have the same set of variables, and – All variables are “distinguished” 24 PODS 2001
Query Equivalence • Useful for optimization • Given a query, equivalent queries can be created by transformations: u Teacher u Course v Course Teacher w 25 w v PODS 2001 These two queries are equivalent under both the flexible and semiflexible semantics
Database Equivalence • D 1 and D 2 are equivalent if for all queries Q, the set of matchings of Q w. r. t. to D 1 is equal to the set of matchings of Q w. r. t. to D 2 • Both databases must have the same set of objects and the same root 26 PODS 2001
Database Transformation University 1 Course Logic 2 Compilers Teacher 6 A. Cohen Course 3 4 Teacher. Databases Teacher 1 Teacher A. Cohen 6 Course 2 8 Teacher 8 C. Katz Course 3 4 Logic Compilers Databases C. Katz The databases are equivalent under both A DAG has become a TREE! the flexible and semiflexible semantics 27 PODS 2001
Transforming a Database into a Tree • Reasons for transforming a database into an equivalent tree database: – Evaluation of queries over a tree database is more efficient – In a graphical user interface, it is easier to present trees than dags or cyclic graphs – Storing the data in a serial form (e. g. , XML) requires no references 28 PODS 2001
Transformation into a Tree • There algorithms for – Testing if a database can be transformed into an equivalent tree database, and – Performing the transformation • For the semiflexible semantics – The algorithms are polynomial • For the flexible semantics – The algorithms are exponential 29 PODS 2001
o 0 l 5 l 1 l 2 o 1 l 6 l 3 o 2 o 3 l 4 o 1 l 4 o 6 o 4 l 4 o 4 {o 0, o 5} l 5 o 5 o 2 l 3 o 3 {o 0, o 1, o 4, o 5} {o 0, o 1, o 2, o 4, o 5} 30 l 2 {o 0, o 1, o 3, o 4, o 5} {o 0, o 5, o 6} PODS 2001 o 5 l 6 o 6
Complexity Analysis for Query Containment and Database Equivalence 31 PODS 2001
Complexity of Query Containment • Under the semiflexible semantics, Q 1 Q 2 iff the identity mapping is a semiflexible matching of Q 1 w. r. t. Q 2 • Thus, containment is – in co. NP when Q 1 is a cyclic graph and Q 2 is either a dag or a cyclic graph – in polynomial time in all other cases • Under the flexible semantics, query containment is always in polynomial time 32 PODS 2001
Complexity of Database Equivalence • For the semiflexible semantics, deciding equivalence of databases is – in polynomial time if both databases are dags – in co. NP if one of the databases has cycles • For the flexible semantics, deciding equivalence of databases is polynomial in all cases 33 PODS 2001
Conclusion • Flexible and semiflexible queries facilitate easy and intuitive querying of semistructured databases – Querying the database even when the user is oblivious to the structure of the database – Queries are insensitive to variations in the structure of the database 34 PODS 2001
Conclusion (cont’d( • Compared to languages that use regular path expressions, – Less expressive power, but – Easier to formulate queries, and – More favorable complexities for • Query evaluation, and • Query optimization 35 PODS 2001
Thank You Questions? 36 PODS 2001
- Slides: 36