Oblivious Querying of Data with Irregular Structure 1
Oblivious Querying of Data with Irregular Structure 1
Based on Several Works • Queries with Incomplete Answers – Yaron Kanza, Werner Nutt, Shuky Sagiv • Flexible Queries – Yaron Kanza, Shuky Sagiv • SQL 4 X – Sara Cohen, Yaron Kanza, Shuky Sagiv • Computing Full Disjunctions – Yaron Kanza, Shuky Sagiv 2
Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (Qw. IA) Flexible queries (FQ) Oblivious querying = Qw. IA + FQ Using Qw. IA and FQ for information integration 3
Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (Qw. IA) Flexible queries (FQ) Oblivious querying = Qw. IA + FQ Using Qw. IA and FQ for information integration 4
The Semistructured Data Model • Data is described as a rooted labeled directed graph • Nodes represent objects • Edges represent relationships between objects • Atomic values are attached to atomic nodes 5
Movie Database 1 Movie 11 Actor 21 Name 30 Mark Hamill 22 Name Title 12 Star 1977 Wars Harrison Ford 25 Name 32 Title 26 Movie Name T. V. Series 27 Léon Natalie Portman Actor 14 13 Actor Title Year 23 24 31 Film Kyle Title Mac. Lachlan 33 Magnolia A Movie Database Example 6 28 29 Title Year 34 35 36 Twin Peaks Dune 1984
<? xml version=“ 1. 0”? > <MDB> <Movie> <Title>Star Wars</Title> <Year>1977</Year> <Actor> <Name>Mark Hamill</Name> </Actor> <Name>Harrison Ford</Name> </Actor> </Movie> … </MDB> XML that Encodes the Semistructured Data 7
What Should be the form of the Query? Movie Database 1 Movie 11 Actor 21 Name 30 Mark Hamill 8 22 Name Title 12 Star 1977 Wars Harrison Ford 25 Name 32 Title 26 Movie Name T. V. Series 27 Léon Natalie Portman Actor 14 13 Actor Title Year 23 24 31 Film 28 Kyle Title Mac. Lachlan 33 Magnolia 29 Title Year 34 35 36 Twin Peaks Dune 1984 Consider a Query that Requests Movies, Actors that Acted in the Movies and the Movies’ Year of Release
The movie has a year attribute Movie Database 1 Movie 11 Actor 21 Name 30 Mark Hamill 22 Name Title 12 Star 1977 Wars Harrison Ford 25 Name 32 Title 26 Movie Name T. V. Series 27 Léon Natalie Portman Actor 14 13 Actor Title Year 23 24 31 Film 28 Kyle Title Mac. Lachlan 33 Magnolia Incomplete Data 9 The year of the movie is missing 29 Title Year 34 35 36 Twin Peaks Dune 1984
Movie Database Actor below movie Movie 11 Actor 21 Name 30 Mark Hamill 22 Name Title Harrison Ford 25 Name 32 Title 26 Movie Name T. V. Series 27 Léon Natalie Portman Actor 14 13 Actor Title Year Star 1977 Wars 31 Film 12 23 24 Movie below actor 1 Kyle Title Mac. Lachlan 33 Magnolia Variations in Structure 10 28 29 Title Year 34 35 36 Twin Peaks Dune 1984
Movie Database 1 A movie label 11 Actor 21 Name 30 Mark Hamill 11 22 Name Title 25 Name 32 Title 26 Natalie Portman Movie Name T. V. Series 27 Léon A film label 13 13 Actor Title Year Star 1977 Wars Harrison Ford Film 12 23 24 31 Actor Movie 28 Kyle Title Mac. Lachlan 34 Magnolia 29 Title Year 33 34 35 Twin Peaks Dune 1984 Dealing with ontology variations is Ontology Variations beyond the scope of this talk
Irregular Data • Data is incomplete – Missing values of attributes in objects • Data has structural variations – Relationships between objects are represented differently in different parts of the database • Data has ontology variations – Different labels are used to describe objects of the same type 12
Irregular data does not conform to a strict schema Queries over irregular data should not be rigid patterns The schema cannot guide a user in formulating a query 13
Data is contributed by many users in a variety of designs The query should deal with different structures of data The structure of the database is changed frequently Queries should be rewritten frequently The description of the schema is large (e. g. , a DTD of XML) It is difficult to use the schema when formulating queries In Which Cases is it Difficult to Formulate 14 Queries over Semistructured Data?
Can Regular Expressions Help in Querying Irregular Data? • In many cases, regular expressions can be used to query irregular data • Yet, regular expressions are – Not efficient – it is difficult to evaluate regular expressions – Not intuitive – it is difficult for a naïve user to formulate regular expressions 15
More on Using Regular Expressions • When querying irregular data, the size of the regular expression could be exponential in the number of labels in the database – For n types of objects, there are n! possible hierarchies – For an object with n attributes, there are 2 n subsets of missing attributes 16
Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (Qw. IA) Flexible queries (FQ) Oblivious querying = Qw. IA + FQ Using Qw. IA and FQ for information integration 17
Queries with Incomplete Answers • We have developed queries that deal with incomplete data in a novel way and return incomplete answers • The queries return maximal answers rather than complete answers • Different query semantics admit different levels of incompleteness 18
Queries with complete answers Queries with AND Semantics Queries with Weak Semantics Queries with OR Semantics Queries with Incomplete Answers 19 Increasing level of incompleteness
Queries and Matchings • The queries are labeled rooted directed graphs • Query nodes are variables • Matchings are assignments of database objects to the query variables according to – the constraints specified in the query, and – the semantics of the query 20
Constraints On Complete Matchings • Root Constraint: • Satisfied if the query root is mapped to the db root Query Root r 1 Database Root • Edge Constraint: • Satisfied if a query edge with label l is mapped to a database edge with label l x l l y 21 12 25
Movie Database A Complete Matching 1 r Movie Producer Movie 11 Uncredited Director Actor Title Actor Year 21 22 23 24 12 Actor 27 25 x Director Title 26 29 Star 1977 Name Hook Name Wars Date of 34 30 32 31 33 birth George Dustin Steven Mark Harrison Lucas Hoffman Spielberg 35 Hamill Ford 14 May 1944 22 Movie Producer y Director Uncredited Actor z Date of birth Name u The root constraint and All the nodes are mapped all the edge constraints to non-null values are satisfied v
Movie Database 1 r Movie Producer Movie 11 Uncredited Director Actor Title Actor Year 21 22 23 24 12 Actor 27 25 x Director Title 26 29 Star 1977 Name Hook Name Wars Date of 34 30 32 31 33 birth George Dustin Steven Mark Harrison Lucas Hoffman Spielberg 35 Hamill Ford 14 May 1944 Consider the case where Node 35 is removed from the database 23 Movie Producer y Director Uncredited Actor z Date of birth Name u v No Complete Matching Exists!
Movie Database 1 r Movie Producer Movie 11 Uncredited Director Actor Title Actor Year 21 22 23 24 Star 1977 Name Wars 30 31 Mark Harrison Hamill Ford 24 Actor 27 25 NULL 12 Director Title 26 29 Name Hook 32 33 George Dustin Lucas Hoffman 34 Steven Spielberg x Movie NULL Producer y Director Uncredited Actor z NULL Date of birth Name u v NULL This is not a matching, since the sequence of labels Every from the Not database root. Partial to Node. Assignment 31 is different from is an Incomplete Matching any sequence of labels that starts at the query root and ends in variable v
The Reachability Constraint on Partial Matchings • A query node v that is mapped to a database object o satisfies the reachability constraint if there is a path from the query root to v, such that all edge constraints along this path Database are satisfied 11 r l 2 l 1 x w l 4 l 3 y 25 Query z l 5 vv l 6 5 l 1 l 2 7 l 4 l 3 8 9 l 5 l 6 55
“And” Matchings • A partial matching is an AND matching if – The root constraint is satisfied – The reachability constraint is satisfied by every query node that is mapped to a database node – If a query node is mapped to a database node, all the incoming edge constraints are satisfied r x Producer y Actor Director 26 z
Movie Database 1 r Movie Producer Movie 11 Uncredited Director Actor Title Actor Year 21 22 23 24 Star 1977 Name Wars 30 31 Mark Harrison Hamill Ford 12 Actor 27 25 x Director Title 29 26 Name Hook 32 33 George Dustin Lucas Hoffman 34 Steven Spielberg An AND Matching 27 Movie Producer y Director Uncredited Actor z Date of birth Name u NULL v
Movie Database 1 r Movie Producer Movie 11 Uncredited Director Actor Title Actor Year 21 22 23 24 Star 1977 Name Wars 30 31 Mark Harrison Hamill Ford 12 Actor 27 25 x Director Title 26 29 Name Hook 32 33 George Dustin Lucas Hoffman Suppose that we remove the edges that are labeled with Uncredited Actor 28 34 Movie Producer y Director Uncredited Actor z Date of birth Name u v Steven Spielberg In an AND matching, Node z must be null!
Weak Satisfaction of Edge Constraints • Edge Constraint: • Is Weakly Satisfied if it is either • Satisfied (as defined earlier), or • One (or more) of its nodes is mapped to a null value null x 12 l l y 25 x 12 l m y 25 null 29 x l null 12 m y 25 x l y null
Weak Matchings • A partial matching is a weak matching if – The root constraint is satisfied – The reachability constraint is satisfied by every query node that is mapped to a database node – Every edge constraint is weakly satisfied 30
Movie Database 1 r Movie Producer Movie 11 12 Actor Title Director Actor Year 21 22 23 24 Star 1977 Name Wars 30 27 Mark Harrison Hamill Ford 25 Director Title 26 33 George Dustin Lucas Hoffman A Weak Matching 31 29 Name Hook 32 31 Actor x 34 Steven Spielberg Movie Producer y Director Uncredited Actor z NULL Date of birth Name u v NULL Edges that are weakly satisfied
In a weak matching, all four options are permitted In an AND matching, only the first three options are permitted null x 12 l l y 25 x 12 l m y 25 null 32 null x x l l y m y null 12 25
Movie Database 1 r Movie Producer Movie 11 12 Actor Title Director Actor Year 21 22 23 24 Star 1977 Name Wars 30 31 Mark Harrison Hamill Ford 27 Actor 25 Director Title 26 29 Name Hook 32 33 George Dustin Lucas Hoffman Consider the case where edges labeled with Producer are removed 33 x 34 Movie Producer y Director Uncredited Actor z Date of birth Name u v Steven Spielberg In a weak matching, Node z must be null!
“OR” Matchings • A partial matching is an OR matching if – The root constraint is satisfied – The reachability constraint is satisfied by every query node that is mapped to a database node 34
Movie Database 1 r Movie 11 12 Actor Title Director Actor Year 21 22 23 24 Star 1977 Name Wars 30 31 Mark Harrison Hamill Ford 27 Actor 25 x Director Title 29 26 Name Hook 32 33 George Dustin Lucas Hoffman 34 Steven Spielberg An OR Matching 35 Movie Producer y Director Uncredited Actor z NULL Date of birth Name u v NULL An edge which is not weakly satisfied
Increasing Level of Incompleteness • A complete matching is an AND matching • An AND matching is a weak matching • A weak matching is an OR matching 36
Maximal Matchings • A tuple t 1 subsumes a tuple t 2 if t 1 is the result of replacing some null values in t 2 by non-null values: t 1=(1, 5, 2, null) t 2=(1, null, 2, null) Matchings are represented as tuples of oid’s and null values • A matching is maximal if no other matching subsumes it • A query result consists of maximal matchings only 37
On the Complexity of Computing Queries with Incomplete Answers • The size of the result can be exponential in the size of the input (database and query) – Note that the same is true when joining relations – the size of the result can be exponential in the size of the input (database and query) • Instead of using data complexity (where the runtime depends only on the size of the database), we use input-output complexity 38
Input-Output Complexity In input-output complexity, the time complexity is a function of the size of the query, the size of the database, and the size of the result. 39
The Motivation for Using I/O Complexity • Measuring the time complexity with respect to the size of the input does not separate between the following two cases: – An algorithm that does an exponential amount of work simply because the size of the output is exponential in the size of the input – An algorithm that does an exponential amount of work even when the query result is small • Either the algorithm is naïve (e. g. , it unnecessarily computes subsumed matchings) or the problem is hard 40
I/O Complexity of Query Evaluation (lower bounds are for non-emptiness) Query / Semantics Complete 41 Path Query PTIME Tree Query DAG Query Cyclic Query PTIME NPComplete AND PTIME NPComplete Weak PTIME OR PTIME Recent Results (PODS’ 03)
Filter Constraints • Constraints that filter the results (i. e. , the maximal matchings) • There are – Weak filter constraints (the constraint is satisfied if a variable in the constraint is null) – Strong filter constraints (all variables must be non-null for satisfaction) • Existence constraint: !x is true if x is not null 42
I/O Complexity of Query Evaluation with Existence Constraints (lower bounds are for non-emptiness) Query / Semantics Complete AND Weak OR 43 Path Query PTIME Tree Query DAG Query Cyclic Query PTIME NPComplete NPComplete
I/O Complexity of Query Evaluation with Weak Equality/Inequality Constraints (lower bounds are for non-emptiness) Query / Semantics Strong AND Weak OR 44 Path Query Tree Query DAG Query Cyclic Query PTIME NPComplete NPComplete NPComplete
Query Containment • Query containments for queries with incomplete answers is defined differently from query containment for queries with complete answers • Q 1 Q 2 if for all database D, every matching of Q 1 w. r. t. to D is subsumed by a matchings of Q 2 w. r. t. to D • Query containment (query equivalence) is useful for the development of optimization techniques 45
Containment in AND Semantics • Homomorphism between the query graphs is necessary and sufficient for containment Q 1 r l 1 l 2 y l 3 u x Q 2 r l 1 l 2 z l 3 l 4 v u p q Q 1 Q 2 l 4 v homomorphism • Deciding whether one query is contained in another is NP-Complete 46
Containment in OR Semantics • The following is a necessary and sufficient condition for query containment in OR semantics • For every spanning tree T 1 of the contained query, there a spanning tree T 2 of the containing query, such that there is a homomorphism from T 2 to T 1 – is in ΠP 2 – NP-Complete if the containee is a tree – polynomial if the container is a tree 47
Containment in Weak Semantics • Similar to containment in OR Semantics, with the following difference • Instead of checking homomorphism between spanning trees, we check homomorphism between graph fragments – A graph fragment is a restriction of the query to a subset of the variables that includes the query root such that every node in the fragment is reachable from the root 48
Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (Qw. IA) Flexible queries (FQ) Oblivious querying = Qw. IA + FQ Using Qw. IA and FQ for information integration 49
Flexible Queries • To deal with structural variations in the data, we have developed flexible queries 50
Rigid Queries Semiflexible Queries Flexible Queries 51 Increasing level of flexibility
Example Movie Database r Actor x y Movie z A query that finds all pairs of actors that acted in the same movie However, if in the database, actors are descendents of movies, the query has to be reformulated Instead, we propose new ways of matching queries to databases 52
Rigid matchings and complete matchings are the same Returning rigid matchings is the usual semantics for queries (e. g. , XQuery, Lorel, XML-QL, etc. ) 53
Constraints On Rigid Matchings • Root Constraint: • Satisfied if the query root is mapped to the db root Query Root r 1 Database Root • Edge Constraint: • Satisfied if a query edge with label l is mapped to a database edge with label l x l l y 54 12 25
Movie Database 1 Movie Actor 21 Name 30 22 Name Title Star 1977 Wars 31 Mark Harrison Hamill Ford 25 Movie T. V. Series Name 28 Léon Title Name 32 26 34 Natalie Portman A Rigid Matching 55 14 Actor Title Year 23 24 Actor 12 11 Actor r 27 x Movie y 29 Title Year 35 36 Twin 1984 Peaks Dune Kyle Mac. Lachlan This is not a Rigid Matching
A Semiflexible Matching • The query root is mapped to the db root • A query node with an incoming label l is mapped to a db node with an incoming label l • The image of every query path is embedded in some database path • SCC is mapped to SCC 56 Query Root DB Root 1 r x l y × 9 l 11
A Semiflexible Matching Query • The query root is Root The lasttotwo conditions mapped the db root r cannot be verified locally, • A query node with an i. e. , by label considering incoming l is one query at a time mapped to a dbedge node with an incoming label l • The image of every query path is embedded in some database path • SCC is mapped to SCC 57 DB Root 1 x l 9 l y 11
Movie Database 1 Movie 11 Actor 21 Name 30 22 Name Title Star 1977 Wars Mark Harrison Hamill Ford 14 Movie T. V. Series Actor Title Year 25 Name 28 Léon Title Name 32 26 Natalie Portman 34 27 x Movie y 29 Title Year 35 36 Twin 1984 Peaks Dune Kyle Mac. Lachlan The Semiflexible Matchings 58 Actor 12 23 24 31 r We get all the actor-movie pairs
r r Actor Movie y Actor x x Movie y Under semiflexible semantics, these two queries are equivalent The user does not have to know if movies are above or below actors in the database 59
Movie Database 1 Movie 11 Actor 21 Name 30 22 Name Title Star 1977 Wars Mark Harrison Hamill Ford x 14 Name 28 Léon Title Name 32 26 Natalie Portman Movie T. V. Series Actor Title Year 25 Actor 12 23 24 31 r 34 27 Actor y Movie z 29 Title Year 35 36 Twin 1984 Peaks Dune Kyle Mac. Lachlan Impossible to get this pair by means We get of apairs rigid of Another since Example a is a dag matching, the of query actors andthat theacted db in 60 Matching is Semiflexible a tree the same movie
A Flexible Matching • The query root is mapped to the db root • A query node with an incoming label l is mapped to a db node with an incoming label l • An edge is mapped to two nodes on one path • Notice that a path in the query is not necessarily mapped to a path in the db 61 Query Root r DB Root 1 x l 9 l y 11
An Example of a Flexible Query r Director A director x Movie A movie of the director The director name Name y z Title Actor An actor in the movie The name of the actor 62 u v Name w The title of the movie
Movie Database 1 r Movie Director Producer Movie 11 Actor Title Director Actor Year 21 22 23 24 x 12 27 Actor 25 Director Title 26 Movie y z 29 Star 1977 Name Hook Name Wars Date of 34 30 32 31 33 birth George Dustin Steven Mark Harrison Lucas Hoffman Spielberg 35 Hamill Ford 14 May 1944 Name Title Actor flexible neither a rigid AThis query edge matching is mappedisto matching nor a semiflexible matching two db nodes on one path 63 u v Name w
Movie Database 1 r Movie Producer Movie 11 Actor Title Director Actor Year 21 22 23 24 x 12 27 Actor 25 Director Title 26 29 Movie y Star 1977 Name Hook Name Wars Date of 34 30 32 31 33 birth George Dustin Steven Mark Harrison Lucas Hoffman Spielberg 35 Hamill Ford 14 May 1944 In this. Why flexible matching, a producer are semiflexible matchingsis given with that he directed but did not produce preferred sometimes to flexible matchings? 64 a movie
Movie Database 1 r Movie Producer Movie 11 Producer Actor 99 Title Director Actor Year 21 22 23 24 27 x 12 Actor 25 Director Title 26 29 Star 1977 Name Hook Name Wars Date of 34 30 32 31 33 birth George Dustin Steven Mark Harrison Lucas Hoffman Spielberg 35 Hamill Ford 14 May 1944 Movie y In semiflexible semantics, the problem is solved since the image of a query path is embedded in 65 a database path
Differences Between the Semiflexible and Flexible Semantics • On a technical level, in flexible matchings – Query paths are not necessarily embedded in database paths – SCC’s are not necessarily mapped to SCC’s • On a conceptual level, in the semiflexible semantics, nodes are “semantically related” if they are on the same path, and hence – Query paths are embedded in database paths • In the flexible semantics, this condition is relaxed: – Query edges are embedded in database paths 66
Increasing Level of Flexibility • A rigid matching is a semiflexible matching • A semiflexible matching is a flexible matching 67
Verifying that Mappings are Semiflexible Matchings • Is a given mapping of query nodes to database nodes a semiflexible matching? – Not as simple as for rigid matchings (no local test, i. e. , need to consider paths rather than edges) • In a dag query, the number of paths may be exponential – Yet, verifying is in polynomial time • In a cyclic query, the number of paths may be infinite – Yet, verifying is in exponential time 68
Verifying that a Mapping is a Semiflexible Matching Query / Database Path Query Tree Query DAG Query Cyclic Query No Path Database PTIME Tree Database PTIME DAG Database PTIME matchings Cyclic Database PTIME co. NP 69 matchings No
Input-Output Complexity of Query Evaluation for the Semiflexible Semantics • Next slide summarizes results about the input-output complexity – Polynomial for a dag query and a tree database (or simpler cases) • Rather difficult to prove, even when the query is a tree, since there is no local test for verifying that mappings are semiflexible matchings – Exponential lower bounds for other cases 70
I/O Complexity for SF Semantics (lower bounds are for non-emptiness) Query / Database Path Query PTIME Tree Query PTIME DAG Query Cyclic Query PTIME Result is empty Tree Database PTIME Result is empty DAG Database NPComplete Result is empty Cyclic Database NPComplete NP-Hard (in P 2) 71 Data Complexity is Polynomial in all Cases
Query Evaluation for the Flexible Semantics • The database is replaced with a relationship graph which is a graph, such that – The nodes are the nodes of the database – Two nodes are connected by an edge if there is a path between them in the database (the direction of the path is unimportant) • The query is evaluated under rigid semantics w. r. t. the relationship graph 72
I/O Complexity of Query Evaluation for the Flexible Semantics • Results follow from a reduction to query evaluation under the rigid semantics • Tree query – Input-Output complexity is polynomial • DAG query – Testing for non-emptiness is NP-Complete 73
Query Containment • Q 1 Q 2 if for all database D, the set of matchings of Q 1 w. r. t. to D is contained in the set of matchings of Q 2 w. r. t. to D • We assume that – Both queries have the same set of variables 74
Complexity of Query Containment • Under the semiflexible semantics, Q 1 Q 2 iff the identity mapping from the variables of Q 2 to the variables of Q 1 is a semiflexible matching of Q 2 w. r. t. Q 1 • Thus, containment is – in co. NP when Q 1 is a cyclic graph and Q 2 is either a dag or a cyclic graph – in polynomial time in all other cases • Under the flexible semantics, query containment is always in polynomial time 75
Database Equivalence • D 1 and D 2 are equivalent if for all queries Q, the set of matchings of Q w. r. t. to D 1 is equal to the set of matchings of Q w. r. t. to D 2 • Both databases must have the same set of objects and the same root 76
Complexity of Database Equivalence • For the semiflexible semantics, deciding equivalence of databases is – in polynomial time if both databases are dags – in co. NP if one of the databases has cycles • For the flexible semantics, deciding equivalence of databases is polynomial in all cases 77
Database Transformation MDB 1 Actor 2 Actor 3 Actor 4 Dustin Harrison Movie Hoffman Ford Movie 6 Hook 8 Star Wars 1 Movie Hook Mark Hamill Actor 2 Dustin Hoffman Movie 6 8 Star Wars Actor 3 Harrison Ford Actor 4 Mark Hamill The databases are equivalent under both A DAG has become a TREE! the flexible and semiflexible semantics 78
Transforming a Database into a Tree • Reasons for transforming a database into an equivalent tree database: – Evaluation of queries over a tree database is more efficient – In a graphical user interface, it is easier to represent trees than DAGs or cyclic graphs – Storing the data in a serial form (e. g. , XML) requires no references 79
Transformation into a Tree • There algorithms for – Testing if a database can be transformed into an equivalent tree database, and – Performing the transformation • For the semiflexible semantics – The algorithms are polynomial • For the flexible semantics – The algorithms are exponential 80
Implementing Flexible Queries • Flexible queries were implemented in SQL 4 X • In an SQL 4 X query, relations and XML documents are queried simultaneously • A query result can be either a relation or an XML document 81
A query under the Flexible Semantics r Director x Name Movie y z Title v An SQL 4 X Query QUERY AS RELATION SELECT text(y) as director, text(v) as title FROM x Director of ‘MDB. xml’, y Name of x, z Movie of x, v Title of z 82
A query under the Flexible Semantics r Director x Name Movie y z Title v An SQL 4 X Query Constraints can be added QUERY AS RELATION SELECT text(y) as director, text(v) as title FROM x Director of ‘MDB. xml’, y Name of x, z Movie of x, v Title of x WHERE text(v) = ‘Star Wars’ 83
A query under the Flexible Semantics Film. Budgets r Director x Title Name Movie y z Title v … … Budget … … A relation with data about film budgets An SQL 4 X Query QUERY AS RELATION SELECT text(x) as director, text(v) as title, Budget FROM x Director of ‘MDB. xml’, y Name of x, z Movie of x, v Title of x, Film. Budgets WHERE text(v) = Film. Budgets. Title 84 Relations and XML Documents can be queried simultaneously
Agenda Why is is difficult to query semistructured data? Queries with incomplete answers (Qw. IA) Flexible queries (FQ) Oblivious querying = Qw. IA + FQ Using Qw. IA and FQ for information integration 85
Combining the Paradigms • In oblivious querying: – The user does not have to know where data is incomplete – The user does not have to know the exact structure of the data • The paradigm of flexible queries and the paradigm of queries with incomplete answers should be combined 86
Flexible Queries with Incomplete Answers • A flexible query w. r. t. a database is actually a rigid query w. r. t. the relationship graph • Evaluating a query in AND-semantics (weak semantics, OR-Semantics) w. r. t. the relationship graph produces a flexible query that returns maximal answers rather than complete answers 87
Movie Database 1 r Movie Director Producer Movie 11 Actor Title Director Actor Year 21 22 23 24 x 12 27 Actor 25 Director Title 26 29 Star 1977 Name Hook Name Wars Date of 34 30 32 31 33 birth George Dustin Steven Mark Harrison Lucas Hoffman Spielberg 35 Hamill Ford 14 May 1944 88 Consider the case where Node 25 and Node 33 are removed Movie Name y z Title Actor u v Name w
Movie Database 1 r Movie Director Producer Movie 11 Actor Title Director Actor Year 21 22 23 24 x 12 27 Star 1977 Name Wars Date of 30 32 31 birth George Mark Harrison Lucas 35 Hamill Ford 14 May 1944 Director Title 26 Movie y z 29 Title Actor Hook Name 34 Steven Spielberg Name u NULL v Name w NULL A Flexible matching which is also an incomplete (maximal) matching 89
Agenda Why is is difficult to query semistructured data? Queries with incomplete answers (Qw. IA) Flexible queries (FQ) Oblivious querying = Qw. IA + FQ Using Qw. IA and FQ for information integration 90
Full Disjunction • Intuitively, the full disjunction of a given set of relations is the join of these relations that does not discard dangling tuples • Dangling tuples are padded with nulls • Only maximal tuples are retained in the full disjunction (as in the case of Qw. IA( 91
Movies Actors m-id title year language a-id name 1 Zelig 1983 English 1 Woody Allen 1/12/1935 2 Antz 1998 English 2 Bruce Willis 19/3/1955 3 Armageddon 1998 English 3 Julia Roberts 28/10/1967 4 Fantasia 1940 English Acted-in a-id m-id role Actors-that-Directed a-id m-id 1 date-of-birth 1 The Full Disjunction of the Given Relations Zelig 1 2 Z 2 3 Harry title year 1 Zelig 1983 English 1 Woody Allen 1/12/1935 Zelig 2 Antz 1998 English 1 Woody Allen 1/12/1935 Z 3 Armageddon 1998 English 2 Bruce Willis 19/3/1955 Harry 4 Fantasia 1940 English 3 Julia Roberts 28/10/1967 92 name 1 m-id language a-id 1 Date-of-birth role
Movies m-id title year language 1 Zelig 1983 English 2 Antz 1998 English 3 Armageddon 1998 English 4 Fantasia This tuple will not be in the full disjunction 1940 English m-id title year language a-id 1 Zelig 1983 English name Date-of-birth role The Full Disjunction of the Given Relations m-id title year language a-id name Date-of-birth role 1 Zelig 1983 English 1 Woody Allen 1/12/1935 Zelig 2 Antz 1998 English 1 Woody Allen 1/12/1935 Z 3 Armageddon 1998 English 2 Bruce Willis 19/3/1955 Harry 4 Fantasia 1940 English The does not subsumed 3 include Julia Roberts 28/10/1967 tuples 93 full disjunction
Movies Actors m-id title year language a-id name 1 Zelig 1983 English 1 Woody Allen 1/12/1935 2 Antz 1998 English 2 Bruce Willis 19/3/1955 3 Armageddon 1998 English 3 Julia Roberts 28/10/1967 4 Fantasia 1940 English Acted-in date-of-birth a-id m-id role a-id m-id 1 1 Zelig 1 1 2 Z The Full Disjunction of the Given Relations 2 3 Harry m-id 1 4 2 title Zelig Fantasia Antz year 1983 1940 1998 3 Armageddon 1998 English Actors-that-Directed 1 language English a-id 1 3 1 name Woody Allen Julia Roberts Woody Allen Date-of-birth 1/12/1935 28/10/1967 1/12/1935 role Zelig Z 2 Bruce Willis 19/3/1955 Harry The full disjunction does not include tuples that are based Fantasia 1940 English on Cartesian Product rather than join 3 Julia Roberts 28/10/1967 94 4
In the Full Disjunction of a Given Set of Relations: Every tuple of the input is a part of at least one tuple of the output Tuples are joined as in a natural join, padded with null values The result includes only “maximal connected portions” 95
Motivation for Full Disjunctions • Full disjunctions have been proposed by Galiando-Legaria as an alternative for outerjoins [SIGMOD’ 94[ • Rajaraman and Ullman suggested to use full disjunctions for information integration [PODS’ 96[ 96
Computing Full Disjunctions for γ-acyclic Relation Schemas • Rajaraman and Ullman have shown how to evaluate the full disjunction by a sequence of natural outerjoins when the relation schemas are γ-acyclic • Hence, the full disjunction can be computed in polynomial time, under input-output complexity, when the relation schemas are γ -acyclic 97
Weak Semantics Generalizes Full Disjunctions • Relations can be converted into a semistructured database • The full disjunction can be expressed as the union of several queries that are evaluated under weak semantics We have developed an algorithm that uses this generalization to compute full disjunctions in polynomial time under I/O complexity, even when the relation schemas are cyclic 98
Generalizing Full Disjunctions • In a full disjunction, tuples are joined according to equality constraints as in a natural join (or equi-join( • We can generalize full disjunctions to support constraints that are not merely equality among attributes 99
Example Movies (m-id, title, year, language, location) Actors (a-id, name, date-of-birth) Acted-in (a-id, m-id, role) Actors-that-Directed (a-id, m-id) Historical-Events (name, date, description) Historical-Sites (Country, State, City, Site) The date of the historical event is a date in the year when The filming location is near the historical site the movie was released 100
Another Way of Generalizing Full Disjunctions: Use OR-Semantics • OR-semantics is used rather than weak semantics when tuples are joined • This relaxes the requirement that every pair of tuples should be join consistent • Instead, a tuple of the full disjunction is only required to be generated by database tuples that form a connected subgraph, but need not be pairwise join consistent 101
Example Employees (e-id, ename, city, dept-no) Departments (dept-no, dname, building) Located-in (building, city, street) Employee: (007, James Bond, London, 6) Department: (6, MI-6, 10) Located-in: (10, Liverpool, King) e-id ename city 007 James Bond London 6 6 6 102 dept dname -no building MI-6 10 10 Liverpool King The Full Disjunction city street
Example Employees (e-id, ename, city, dept-no) Departments (dept-no, dname, building) Located-in (building, city, street) Employee: (007, James Bond, London, 6) Department: (6, MI-6, 10) Located-in: (10, Liverpool, King) e-id ename city 007 James Bond London 103 dept dname -no 6 6 MI-6 building 10 10 city street Liverpool King The Full Disjunction under OR-Semantics
Information Integration from Heterogeneous Sources Integrated Relation Query Data Source 104
We use queries that combine flexible semantics and weak semantics: -The queries are insensitive to changes in the data - Easy to formulate the query Integrated Relation Query Data Source 105
The integration of the relations is done with a full disjunction of the computed relations Integrated Relation Query Data Source 106
Conclusion • Flexible and semiflexible queries facilitate easy and intuitive querying of semistructured databases – Querying the database even when the user is oblivious to the structure of the database – Queries are insensitive to variations in the structure of the database 107
Conclusion (continued( • Queries in AND semantics, OR semantics or weak semantics facilitate easy and intuitive querying of incomplete databases – Querying the database even when the user is oblivious to missing data – Queries return maximal answers rather than complete answers 108
Conclusion (continued( • The two paradigms of flexible queries and queries with maximal answers can be combined • The combination of the paradigms can facilitate integration of information from heterogeneous sources 109
Conclusion (continued( • Full disjunctions can be computed using queries in weak semantics • Full disjunctions can be generalized so that relations are joined using constraints that are not merely equality constraints 110
Thank You Questions? 111
- Slides: 111