Representing Querying XML With Incomplete Information Serge Abiteboul
Representing & Querying XML With Incomplete Information Serge Abiteboul Luc Segoufin Victor Vianu Sigalit Batiashvili 1 Sima Sigalit Kdoshim SDBI 2001 25 nov 2001
Goals Motivation - Data on the WEB - Incompleteness problem Representation System Refine Algorithm Querying The Incomplete Information - CWA approach - Answer using sub queries (CWA+OWA) 2
Introduction & Motivation 3
Data On The WEB • Irregular structure - self describing • Partial Information is known - expiration of data - unavailable sites - modification of data, etc. Semistructured Data 4
Data On The Web Problem Data no longer fits into tables (no rigid structure). We Want. . Apply database-like functionality to access data on the WEB. Focus: XML-ized portion of the WEB 5
XML • e. Xtended Markup Language DTD • The Lingua franca of the WEB • Facilitate the use of database techniques (Document Type Definition) to manage WEB data Define constrains on the XML • Brings order Document Structure - 6 nested tags (similar to record structure) ordered sub-elements structure (DTD, XML-Schema)
XML Example <person> <name> Jhon Smith </name> <addr> Green Field Park N. Y. </addr> <email> jhon. smith@infineon. com </email> </person> Person name Jhon Smith 7 addr email Green. . Jhon. smith@. .
View XML As Trees <person> <name> John Smith</name> <addr> Green Field Park N. Y. </addr> <email> john. smith@infineon. com </email> </person> DTD person 8 person name addr email John Green Field john. smith@ Smith Park N. Y. name addr email*
Webhouse Warehouse A collection of information from many sources Webhouse - A collection of website sources - context: XML - Hold a DTD that describes the sources structure 9
Webhouse Maintaince The Webhouse continuously enriched by web sites exploration webhouse 10 Technique: Crawling the web.
Webhouse Information held in the webhouse is never complete. Why? • • • 11 Dynamic nature of WEB data Limited storage capacity Expiration of data Modification of data Etc.
The Problem Posing a query against the webhouse may yield an incomplete answer - - 12 Missing documents satisfying the query in the webhouse Missing the relevant data in the document
Solution Two main approaches • Closed World Assumption (CWA) If some information does not appear explicitly it does not hold. - possible method: Best Effort • Open World Assumption (OWA) Anything not ruled out is possible - possible method: Fetch Data 13
Solution Methods • Best Effort Answer accordingly to the available information • Fetch Data Seek the sources for additional information to provide a complete answer 14
Fetch Data We use the Fetch Data approach We would like to Be able to define what additional resource we are looking for. How ? • Defining the missing portion of the data using the available information • Thus, determining the additional exploration of WEB sources. 15
Example Query 1 Given the DTD catalog Catalog Product Cat product+ name price cat picture* subcat Query 1 Find the name, price & subcategories of electronics products with price < $200 16 cat=elec name price<200 subcat
Answer to query 1 catalog product Canon 120 17 elec Nikton 199 camera elec Sony camera product 175 elec cdplayer
Query 2 Given the DTD Catalog Product Cat product+ name price cat picture* subcat catalog product Query 2 Finds the name & name pictures of all cameras with picture 18 cat=elec picture subcat=camera
Answer Strategy We Already Have. . Query 2 Query 1 Camera (with) Elec Camera (with) Picture Price < $200 picture Price < $200 19
Answer Strategy (cont) We Need. . Camera (with) Picture Price >= $200 20 Query 1 Elec Price < $200 Camera (with) Picture Camera (with) Price < $200 picture Query 2 • No need to query the Web for the whole query • Define the missing information • Reducing the search space
Representation System 21
Framework • Define the data model - for the webhouse repository (XML data) • Define constraint model - simplified DTD • Define query language • Define the representation system for the incomplete information 22
N – set of nodes v: : Q Data Model N labeling function ==nodes labels Qvalue data values mapping catalog product Tdata=<t, , v> name price =Canon =120 23 cat =elec subcat =camera name price =Nikton =199 cat =elec subcat =camera
Data Tree Prefix Tdata=<t, , v> <t’, ’, v’> 24
Tree Type DTD as Regular expression (a) == aa w 1 w 2 wn aa 11 aa 22 … … aann root (catalog) W =1 exactly one child labeled a i i Wi=? at most one child labeled ai Wi=+ at least one child labeled ai a w w a 1 1 a 2 2 25 Wi=* 0 or more child labeled ai … w an n Ttype=< , r, > : element names r: root label
Tree Type Satisfaction Ttype Tdata catalog + product 1 1 name price 26 1 1 satisfies rep(Ttype) = {Tdata: t * cat picture subcat nikon catalog product Ttype} 199 cat c. jpeg =elec subcat =camera
Prefix-Selection Query • We defined the structure of webhouse data using Tree Types. • It is natural to define a pattern based query (tree format). • The matching will thus be done by browsing the input tree. • Such a query is called a PS query. 27
PS-Query Example Ttype catalog + Tquery=q=<t, product , cond> t: rooted tree 1 1 1 * : labeling function cond: ( , , ) name constraints price cat picture 1 subcat 28 data prefixes constraints Tquery catalog product name price <200 cat =elec subcat
PS-Query Answer • Denoted q(t’) where t’ is the data tree. • Consists of a prefix of this tree matching the corresponding query tree nodes. 29
Answer Example catalog product Canon 30 120 product Note: q 1(t’), q 2(t’) share tree data prefixes elec (root and maybe more)price name Nikton 199 elec <200 camera cat =elec subcat
Incomplete Information Data available Prefixes of tree data enriched by previous queries Missing portion Simply define the missing information Using the initial Ttype and queries 31 Incomplete Tree
Conditional Tree Type Provides extensions to Tree Types 1 A Tree Type with a condition function on the tree nodes. dealer Corresponding DTD dealer Used. Cars |New. Cars 2 Allow context Used. Cars ad* dependent structure New. Cars ad* definition. HOW ? | model ad model year 32 Newcars ad Usedcars * ad * model year model
Specialization dealer : ’ dealer Newcars Usedcars * * ad. New = {dealer, ad. Used ad Used. Cars, New. Carsad , ad, model, year} ’= {ad. New, ad. Used} model 33 year model
Specialization dealer Newcars ad. New model 34 : ’ dealer Newcars Usedcars * * CT =< , cond, > type ad. Used ad * year model (ad. New) = (ad. Used) = ad Usedcars * ad year model
Incomplete Tree A tree representing the incomplete information. catalog Query 1: Find the name, price & subcategories of electronics products with price < $200 product name price<200 35 cat=elec subcat
Incomplete Tree (cont) catalog product 120 Canon elec camera 36 product Nikton 199 product What Is Missing? ? elec Sony 175 camera elec cdplayer Query 1: Find the name, price & subcategories of electronics products with price < $200
What Is Missing? product 1 product 2 * * picture name price>200 cat=elec name price cat!=elec subcat All products with category differ than electricity 37 subcat All products with > 200 price
Incomplete Tree T * product 1 catalog * name product Available Information camera 38 camera cat!=elec * subcat Missing Information product 2 Prefix of a full data tree 120 (Tdata) Canon elec Nikton 199 elec Sony 175 price picture Conditional tree type (CTtype) * elec name price>200 cdplayer picture cat=elec subcat
product 1 * catalog product 2 * product 3 product What Is Missing? ? product 2 a product 3 c. jpg 120 Canon elec Nikton 199 elec Sony 175 elec. Olympus elec o. jpg camera 39 camera cdplayer camera Query 2: Find the name & pictures of all cameras with picture
What Is Missing? product 1 * picture name price cat!=elec subcat All products with category differ than electricity 40
What Is Missing? name price product 1 product 2 b * * picture cat!=elec picture name price 200 cat=elec subcat All products with category differ than electricity 41 subcat camera All products with price 200 & subcategory is not camera
What Is Missing? product 1 product 2 c * name price picture cat!=elec subcat name price 200 cat=elec product 2 b subcat=camera * name picture price 200 cat=elec 42 subcat camera All products with price 200 & no picture
Incomplete Tree Definition A Tree T which consists of the following • A data tree Tdata =<t, , v> – Represents the known data – Use labels from • A conditional tree type , CTtype – Represent the missing portion of the data – Use specialized alphabet ’ • A data labeling mapping ’ from Tdata nodes to element in ’. – E. g. ’(n N | (n)=product) = {product, product 3…} 43
Rep(T) Definition • Rep(T) is the set of trees represented by an incomplete tree T. • Tdata Rep(T) A possible completion on the prefix of the available data tree given by T. 44
Rep(T) Definition (cont) Ttype q student Given a Ttype id Rep(T) 45 addr name=shlomo student name student shlomo id Tdata addr
Acquiring Incomplete Information • Refine Algorithm 46
Acquiring Incomplete Info. • How this is done via WEB? - simply using answers to queries • We now show this can be done against the representation system Assumption The input tree is a single document described by a tree type. We can merge few documents to a single one. 47
Refine Motivation • Each query posed against the webhouse defines additional constraints • Answers to these queries help us refine the partial information. • We describe this partial information using incomplete tree. • As we acquire the webhouse for more information we want to be able to define the current incomplete information 48
Refine Motivation (cont) product 2 * picture name price 200 cat=elec Missing All products with 200 49 price subcat
Refine Motivation (cont) product 2 Strong constraint * picture name price 200 cat=elec product 2 b subcat * name no picture price 200 cat=elec 50 product 2 c product 2 refinement subcat camera name price 200 cat=elec subcat=camera
Refine Algorithm • Refine the incomplete information Input T: incomplete tree q: PS-query A: = q(T) answer to q 51 Output T’: incomplete tree compatible with the answer A to q
Refine Algorithm But we only need trees that match the so far incomplete tree q webhouse -1(A) q -1 q (A) The set of trees compatible with the answer to q A=q(T) 52 -1(A) Rep(T’)Rep(T) =Rep(T) q-1 q(A)
Refine Output Defines a new incomplete tree T’ In order to do so we need to define Step 1. CTtype to represent the missing portion 1 2. Tdata to represent the available data 53
Refine Algorithm – step 1 1. Compute the conditional tree type of the negation of q. I. e. Conditional tree for trees which return an empty answer to q. 54
Refine – step 1 1. Compute the conditional tree type of the negation of q. tq ta … a 55 t^a a a ta Define ’ The labels for the new types will be ta defined as specialization of label ‘a‘ ^ ta a 1 a 2 t )= ’( at 2 ) an an a 1= ’( ant^a)= 1 ’( I. e. (a) a a a cond’(ta) cond’(t^ a) =true a ta =¬condq(a) =condq(a)
Refine – step 1 (cont) We defined CT’ cond’ mapping We defined the specialization mapping ’ ^t ). . Lets rules root’ define has type ( tr r root’ type (tr tr^ ) r: t the roott*of…query t* tree , accept everything a a 1 an ta t*a … ta*n t^a i ta * … ^ta * ta* … ta* 56 1 1 , accept everything below a because there the condition of q is not satisfied i i , one of the children must not satisfy a condition of q n
Refine - step 1 Example tq product cat=elec tq-1 =tq negation cat=elec t^r product picture ta ta 2 1 product picture subcat camera Negation computation complexity cat elec subcat=camera picture tb O(|q|*| |) product subcat cat=elec q the tree query size way tomax of children To provide a simple viewnumber the disjunction as defined by 57 no picture subcat=camera
Refine – step 1 Example product 1 * cat=elec picture name price cat!=elec CT subcat product 2 * picture product tq cat elec Note subcat camera -1 picture product subcat pictureintersection yields exactly This name cat=elec price 200 the missing types product 1, no picture cat=elec product 2 b and product 2 c 58 subcat We next show it. . subcat=camera CT’
Refine – step 1 Example product 1 CT’ * name price picture cat!=elec subcat * picture price 200 cat=elec 59 subcat product 1 product 2 name product * cat elec picture subcat name price picture cat!=elec subcat
Refine – step 1 Example product 1 CT’ * name price picture cat!=elec subcat * picture price 200 cat=elec 60 subcat product 2 b product 2 name product cat=elec * picture price cat=elec picture name subcat camera 200 subcat camera
Refine – step 1 Example product 1 CT’ * name price picture cat!=elec subcat product 2 cat=elec price cat=elec 200 picture price 200 cat=elec 61 no picture name * name product 2 c product subcat=camera subcat camera
Node ids Assumption • Persistent node ids Distinct queries against an XML document return nodes with the same id iff the nodes are identical. &231 product canon 120 62 elec camera elec product = * canon 120 &231 product c. jpg camera canon 120 elec c. jpg camera
Node ids Assumption (cont) • A crucial assumption • Make it possible to enrich the information about a given node through consecutive queries • Otherwise, the size of representation system will be too large to handle. - the representation system will need to be extended in order to keep track of the various possible ways of matching nodes returned by different queries 63
Refine Output Defines a new incomplete tree T’ In order to do so we need to define Step 1. CTtype to represent the missing portion 2 2. Tdata to represent the available data 64
Refine – step 2 T’data is the join between Tdata and A To compute. . Nodes in both A and Tdata Compute the intersection. E. g. product 65 Nodes in Tdata But not in A Node type is Specialized using the CT’ we just computed. E. g. product 3 Nodes in A But not in Tdata Refinement of existing type E. g. product 2 a
Drawback – The Blowup Problem root Given a tree type n queries qi (1 i n) b a with empty answers qi Lets follow CT construction qi Where CT belongs to the incomplete tree based on queries q 1… qi 66 root a=i b=i
The Blowup Problem Query q 1 root a=1 b=1 Incomplete tree T q 1 q 1 CT Tdata is empty root 67 a 1 b root a b 1
The Blowup Problem q 2 CT Query q 2 root a=2 1. Compute the q 2 -1 negation of q 2 root b=2 a 2 68 b root a b 2
The Blowup Problem Query q 2 root a=2 69 b=2 q 2 CT 2. Compute the intersection q 1 -1 q 2 CT
The Blowup Problem 2. Compute the intersection q 2 -1 CT q 1 root Continuing theroot computation yields: root q 3 |CT | = 4*2 = 23 = 8 … a 1 CT b qn a n 2 q 1 |CT | = 2 b q 2 -1 a 1, a 2 b 1 Refine algorithm yields a disjunction of 2 n root multiplicity new types statements a 70 b 1 Exponential blowup of representation system b 1, b 2 a root a 1 b 2
Avoiding The Blowup We consider two ways of avoiding the exponential blowup of incomplete trees: Provide Extension to the incomplete tree. conjunctive incomplete trees Put some restrictions on the tree type and the queries. 71
Conjunctive Incomplete Tree root Types defined only as b a 1 disjunction root I. e. root a 1 b a b 1 a 72 Define Type as conjunction of disjunctions root abn) (a 1 b ab 1) … (anb • ai and bi are specialization of a and b, respectively • cond(ai) = ( i), 1 i n • cond(bi) = ( i), 1 i n
Conjunctive Incomplete Tree Without conjunction Algorithm Refine yields a disjunction of 2 n multiplicity statements. With conjunction The incomplete information can be represented using only n conjunctions of disjunctions. 73
Heuristics To deal with the case when the incomplete tree is already too large to be practical • Shrink the incomplete tree by asking critical additional queries that help to eliminate the missing portion. • Loose some information: allows a trade of accuracy against size of incomplete tree. 74
Acquiring Partial Information Summary • Webhouse is acquired using answers to queries • Each answer refines our partial information • Partial information is described using incomplete trees • We compute the new incomplete tree at each stage using Refine algorithm 75
Querying Incomplete Trees 76
Answering Queries Remember. . The known data is of the format product * name price cat name picture subcat 77 product 2 a cameras with picture price 200 cat=elec subcat=camera product 3 elec products (not cameras) name with pricecat=elec 200 subcat camera
Answering Queries Given query 3: product Find the name, price & pictures of all cameras with price < $100 and have at least one picture. We can provide a complete answer to query 3 using the available information. + picture name price<100 cat=elec subcat=camera 78
Answering Queries Given query 4: product Find all cameras No complete answer is available from the known information. We can do the following: * name price picture cat=elec subcat=camera 3. Provide Tell there be more 1. the user complete list may of cameras withcameras price < 200 2. Provide the complete list of cameras with a picture (that are expensive and have no pictures) 79
Answering Queries • Provides an incomplete answer to the query given the knowledge available • No data source access for further information Next. . Mediator Approach: Provide a complete answer but seek the webhouse only for the missing information. The incomplete tree is used as a guide to the mediator. 80
Mediator Approach Additional queries may have to be generated against the input document to obtain the information needed to fully answer the query. product Seek the web only for cameras with price 200 with no picture 0 name price 200 81 picture cat=elec subcat=camera
Mediator Approach (cont) Assumption: The generated queries are local. Local Queries that explore the input document starting from the nodes already available. 82 root Data Tree q PS-query n … Incomplete Tree Tdata … T root
Local Query q PS-query root Data Tree n … … T Incomplete Tree Tdata Local ps-query: p@n p: ps-query node in Tdata root n: n … 83 root
Local Query L: { p @n | p a local query } n 1 n We want the set of queriesk to collect the additional information to fully answer a given ps-query. … L completes T if q(T)=q(T’). … … p@n 1 … 84 root p@nk nk Tdata Data Tree T’ T’ is obtained by extending each node n of Tdata for which p@n L with p@n(T) T rep(T)
Local Query Using local queries help us avoid doing the work already done by previous queries. We want the set of queries L to be non redundant 1. No nodes exist in T returns by query in L 2. No new nodes are returned by distinct queries of L. 3. Queries in L should always return non empty answer. 85
Mediator Approach Conclusion Mediator approach defines combination of the CWA and OWA semantic. CWA – describe the missing information. I. e. some facts are not known OWA – some data still ignored may exist. 86
Assumptions 87
Order 1. Origin XML documents define order on elements. 2. The source DTD may describe the order of children at each node type. 3. Queries may use ordering in their selection patterns. Moving to tree representation lose the original ordering. Assumption No order is required in our representation system 88
Branching Assumption A PS query tree patterns allow just one child with a given label. root Branching Allows multiple children with the same label product camera cdplayer 89
Branching root Tdata a 1 90 q: branching ps-query root a 2 … a a … b=1 b=2 … a an b=n q(T) requires the description of n! possibilities of assigning the n values of b to a 1… an
References • Representing and Querying XML with Incomplete Information. Serge Abiteboul, Luc Segoufin, Victor Vianu. • Incomplete Information and XML Presentation. http: //www-rocq. inria. fr/~abiteboul • A WEB Odyssey: from Codd to XML. Victor Vianu. • Incomplete Information in Relational Database Tomasz Imielinski and Jr. Lipski Witold. 91
- Slides: 91