Semistructured data June 2001 Semistructured data from practice

2 Organization • • • Motivations XML Typing XML Querying XML and the Web

Semistructured data -- June 2001 Motivations 3

4 Motivation: Complex data • • • Structure is irregular (missing/extra data…) Schema does

5 Complex data: mediation User Mediator Ontology meta-data wrapper wrapper Source Source Many data

6 Motivations: The Web today • • Terabytes of data Private web: not publicly

7 The Web today • Browsing • Search engines – in: list of words

8 A new standard XML • HTML is not appropriate for data exchange on

Semistructured data -- June 2001 The most successful semistructured data model: XML 9

10 The origin of XML • Parents – SGML – Relational and OO databases

11 XML: documents + databases • HTML XML – comes from SGML – hypertext

12 HTML = Hypertext Language The <b> X 23 </b> new camera Ref Name

13 XML = Semistructured Data Ref Name Price X 23 Camera 359. 99 R

14 XML: example <dealer> <Used. Cars> <ad> <model>Honda</model> <year>96</year></ad> </Used. Cars> <New. Cars> <ad>

15 XML • Tree or graph • Data and structure/semantics are mixed – Tags

16 XML Very active/noisy field - standards – types (DTD/XML schema), style-sheet (XSL), resource

Semistructured data -- June 2001 Typing XML 17

18 Typing XML • This is heresy for the freedom of the Web •

19 Intuition : the type is a tree dealer Used. Cars New. Cars ad

20 DTD: a grammar Catalog Product* Product Name Price? Cat (Part Quantity)* Part Basic.

21 More complex: specialization • Type of ad depends on its context • One

22 Regular tree automata • Set of accepted trees: regular tree languages • Definable

23 DTDs+specialization Result: DTDs+specialization = regular tree languages • Closure (intersection, union, complement) •

24 Situation today • Many people are using DTDs – Nice and simple in

Semistructured data -- June 2001 Query languages for XML 25

26 Query Languages for XML • Extensions of SQL – first-order-logic – Information retrieval

27 Pattern matching • Tree with variables and constraints • Pattern matching between the

28 Example in Lorel select <offer> Z/name, P’/price </offer> from P in catalog/product, Z

29 What is new in XML queries • A bit new: limited recursion (like

30 Proposal : k-pebble transducers stack [milo, suciu, vianu]

31 k-pebble transducers: result root a c b a a a b b

Semistructured data -- June 2001 XML and the Web 32

33 Why it is the same old story • Massive amounts of data •

34 Why it is not the same old story Databases • rigid structure •

35 The principles of the Web • The uncertainty principle: you can never be

36 What can be reused? • Some technology? indexes, B-trees, distributed query processing (concurrency

37 Metaphor [AV]: the Web is infinite • What are the pages pointing to

38 Computability • Finitely computable: give the answer in finite time – All pages

39 Tough life: the Web is huge • Relational calculus/algebra: logspace data complexity (also

40 The Web keeps changing • Classical: versions, temporal queries • Less classical: monitoring

Semistructured data -- June 2001 Illustration: incomplete information Work with Victor Vianu 41

42 Example Access to an electronic catalog Q 1: name, subcat, price of electronic

43 catalog missing product * product 1 product 2 canon 120 elec camera product

44 Missing data after Q 1 product 2 * * name price cat picture

45 catalog product * * * product 3 product 1 product 2 b product

product + Missing data product 246 a name price cat >200 =elec product 1

47 After two queries • Known information: – Prefix of the real data tree

Semistructured data -- June 2001 Illustration: Xyleme 48

49 A dynamic warehouse of Web data • Warehouse – Xyleme stores huge quantities

50 Technical Challenges 1. Data Acquisition and Maintenance discover data of interest and maintain

51 Technical challenges • • Scale to the web Size of data: billions of

52 Web Heterogeneity • Semantic domains, e. g. , cinema • Many possible types

53 Discover the Domains Cluster DTDs sharing similar « tags » using data mining

54 Answering queries • Choose an ADTD – Automatically, manually, hybrid • For each

Semistructured data -- June 2001 Conclusion 55

56 One Question Only • The web is turning from a large collection of

Slides: 56

Download presentation

Semistructured data -- June 2001 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge. Abiteboul@inria. fr http: //www-rocq. inria. fr/verso Serge. Abiteboul@xyleme. com http: //www. xyleme. com 1

2 Organization • • • Motivations XML Typing XML Querying XML and the Web Illustrations: 2 problems – Incomplete information – Xyleme • Conclusion

Semistructured data -- June 2001 Motivations 3

4 Motivation: Complex data • • • Structure is irregular (missing/extra data…) Schema does not exist or is unknown Schema is rapidly evolving Relational and ODB models are too rigid Example: Bib. Tex, HTML, SGML, XML, ASN. 1, STEP/Express…

5 Complex data: mediation User Mediator Ontology meta-data wrapper wrapper Source Source Many data sources coming and going

6 Motivations: The Web today • • Terabytes of data Private web: not publicly available pages Deep web: data hidden behind forms A lot of public pages • Standard is a document/hypertext language HTML

7 The Web today • Browsing • Search engines – in: list of words – out: sorted list of URLs • Applis: hand-made wrappers – Expensive – Incomplete – Short-lived, not adapted to the Web constant changes [Raghavan ’ 00]

8 A new standard XML • HTML is not appropriate for data exchange on the Web • Standard database models are too constraining for the Web • The solution: a semistructured data model XML – Reminder: a data model consists of a type definition language, a query/update language + more

Semistructured data -- June 2001 The most successful semistructured data model: XML 9

10 The origin of XML • Parents – SGML – Relational and OO databases • • SGML: markup language for documents HTML and the Web: billions of pages Not appropriate for data exchange XML e. Xtensible Mark-up Language – W 3 C and most industrial companies [B 2 B] – Main idea: separate content and presentation – Use tags to represent structure and semantics

11 XML: documents + databases • HTML XML – comes from SGML – hypertext language – also – semistructured data – fixed number of tags – content and presentation are mixed – very difficult to extract data from a page – old standard for the Web – not fixed – not mixed – much easier – new standard

12 HTML = Hypertext Language The X 23 new camera Ref Name Price replaces the X 22 . It X 23 Camera 359. 99 comes equipped with a flash R 2 D 2 Robot 19350. 00 (worth by itself 53. 99 $) Z 25 PC 1299. 99 hard and provides great quality for only 359. 99 $. Information System Text + presentation Where is the data ? HTML

13 XML = Semistructured Data Ref Name Price X 23 Camera 359. 99 R 2 D 2 Robot 19350. 00 Z 25 PC 1299. . . Information System easy Data + Structure Semistructured: more flexible XML <product-table> < product reference=”X 23"> <designation> camera </designation> <price unit=Dollars> 359. 99 </price> <description> … </description> </product> < product reference=”R 2 D 2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description>. . . </product-table>

14 XML: example <dealer> <Used. Cars> <ad> <model>Honda</model> <year>96</year></ad> </Used. Cars> <New. Cars> <ad> <model>Acura</model> </ad> </New. Cars> <ad> <model>R 406</model> </ad> </New. Cars> </dealer> dealer Used. Cars New. Cars ad model Honda ad year model 96 Acura ad model R 406 It is just an unranked tagged ordered tree

15 XML • Tree or graph • Data and structure/semantics are mixed – Tags contain typing information • Core constructor is list of tag/value pairs • Details – Each node may have an arbitrary number of children with distinct or not tags – Nodes also have attributes that are unordered and unique per node – Standard means to represent cyclic data: Id Idrefs

16 XML Very active/noisy field - standards – types (DTD/XML schema), style-sheet (XSL), resource description (RDF. . . ) – DOM, SAX… – WML (wap), Math. ML, SMIL (multimedia), RSS (news), RDF (metadata). . . • How fast will XML conquer the web? – so far rather slow (about 1% now of the visible web; much more in intranets); accelerates (e. g. , with Explorer 5. 5)

Semistructured data -- June 2001 Typing XML 17

18 Typing XML • This is heresy for the freedom of the Web • Essential for data management: query optimization, user interfaces, applications • Differences with standard database typing – Collections are sequences instead of sets – Types may be very large (e. g. , from integration) – Data is more irregular so types should be more permissive – New issues sometimes: you have the data, extract its type, an approximate type

19 Intuition : the type is a tree dealer Used. Cars New. Cars ad ad model text year text model text • Semantics and structure are in paths – dealer/Used. Cars/ad/model

20 DTD: a grammar Catalog Product* Product Name Price? Cat (Part Quantity)* Part Basic. Part + Composed. Part Basic. Part Pame Composed. Part Name (Part Quantity)* • Nice and simple • Shortcoming: type of an element is independent of its context

21 More complex: specialization • Type of ad depends on its context • One way to view it: homomorphism dealer Used. Cars New. Cars adused adnew ad ad model year model

22 Regular tree automata • Set of accepted trees: regular tree languages • Definable in monadic second-order logic dealer q 0 Used p ad r ad New q ad s m y m qf qf qf m qf r Acceptance: there is a computation such that all leaves are labeled qf • variants: top/down bottom/up, nondeterminism, unranked trees

23 DTDs+specialization Result: DTDs+specialization = regular tree languages • Closure (intersection, union, complement) • Tests for validation, inclusion • Static analysis

24 Situation today • Many people are using DTDs – Nice and simple in spite of ugly syntax • New proposal: xml-schema – More powerful but too complicated? • Other proposals: Relax, Trex – Usually based on some kind of regular tree automata • From experience: one will win and not necessarily the best

Semistructured data -- June 2001 Query languages for XML 25

26 Query Languages for XML • Extensions of SQL – first-order-logic – Information retrieval keyword search – Navigation via regular expression + pattern matching Lorel, XML-QL, XMAS… • Structural recursion Un. QL, XSLT… • No official winner – leader is Xquery

27 Pattern matching • Tree with variables and constraints • Pattern matching between the query and the data • Each match provides a valuation for X, Y, Z catalog product X Y name price cat=elec <200 Z subcategory

28 Example in Lorel select <offer> Z/name, P’/price </offer> from P in catalog/product, Z in discount_stores/store, Z/storecatalog/product P’ where P/category=“camera” and P/make=“canon” and P’/id = P/id • Joins like in relational databases • Construction of complex results • Regular expressions for paths (e. g. , W/*/name = “Gates”)

29 What is new in XML queries • A bit new: limited recursion (like in deductive databases) • A bit new but no big deal: constructed answers (like in OODB) • Very new: ordered data • Bothering – Theoretical base is a bit messy: FO, tree automata, bisimulation – No yardstick like relational calculus/algebra

30 Proposal : k-pebble transducers stack [milo, suciu, vianu]

31 k-pebble transducers: result root a c b a a a b b

Semistructured data -- June 2001 XML and the Web 32

33 Why it is the same old story • Massive amounts of data • Providers export data, users access data • Query languages, indexing, optimization • Database paradigm: still effective on the Web

34 Why it is not the same old story Databases • rigid structure • transactions, concurrency control • data independence • controlled (e. g. , known cost model) • coherent system, very polished artifact The Web • flexible, no schema • flexible protocols • fuzzy separation • perfect mess (and that’s why people like it? ) • closer to a natural ecosystem!

35 The principles of the Web • The uncertainty principle: you can never be sure of anything or that the data is consistent • The incompleteness principle: they do not give you all the data you want (but some you don’t want : -) • The chaos principle: you can rarely assume the existence of some global schema • The instability principle: everything keeps changing Every piece of data you got is probably wrong, incomplete, does not conform to its expected type and is probably already stale

36 What can be reused? • Some technology? indexes, B-trees, distributed query processing (concurrency control and transactions not yet) • Database theory? little – – Algebra and rewrite rules for optimization Dependency theory First order and other logics Seems that because of the ordering, it opens the gates for many more tools such as regular/tree languages

37 Metaphor [AV]: the Web is infinite • What are the pages pointing to my homepage? – Google solution: milliseconds – stale data – Freeze the Web: weeks to get exact answer – Exact answer: no means to get it • Leads to reconsider the notion of computation

38 Computability • Finitely computable: give the answer in finite time – All pages reached from my HP in less than 3 links • Eventually computable: each solution is given in finite time; computation may be infinite – All pages reached from my HP • Not computable – Can my HP be reached starting from my HP? • Also: approximate, partial, stale, pipelined answers

39 Tough life: the Web is huge • Relational calculus/algebra: logspace data complexity (also AC 0) • What is the data complexity of an Xquery of the Web? • Complexity of computing on the Web – Logspace in the Web? – Need to trade quality for performance

40 The Web keeps changing • Classical: versions, temporal queries • Less classical: monitoring of the Web [Xyleme] – Smart crawling of the Web: flow of docs – Query subscription: query on this flow – Continuous queries • What is the underlying theory?

Semistructured data -- June 2001 Illustration: incomplete information Work with Victor Vianu 41

42 Example Access to an electronic catalog Q 1: name, subcat, price of electronic products with price less than $200 Q 2: name, pictures of cameras at least pictured once

43 catalog missing product * product 1 product 2 canon 120 elec camera product nikon 199 elecsony 175 elec camera cdplayer Q 1: name, subcat, price of electronic products with price less than 200

44 Missing data after Q 1 product 2 * * name price cat picture >200 =elec !=elec subcategory

45 catalog product * * * product 3 product 1 product 2 b product 2 c product 3 missing product 2 a canon 120 elecc. jpgnikon 199 elec sony 175 elecakai a. jpg elec camera cdplayer Q 2: name, pictures of cameras at least pictured once camera

product + Missing data product 246 a name price cat >200 =elec product 1 * name price cat picture !=elec no picture subcategory product 3 no picture subcategory product 2 c product 2 b * name price cat >200 =elec picture subcategory !=camera name price cat elec name price cat >200 =elec subcategory !=camera Known data

47 After two queries • Known information: – Prefix of the real data tree • Missing information – Complex type • Q 3: name, price, pictures of cameras costing less than $100 and at least pictured once – can be completely answered using A 1, A 2 • Q 4: list all cameras – can be partially answered using A 1, A 2

Semistructured data -- June 2001 Illustration: Xyleme 48

49 A dynamic warehouse of Web data • Warehouse – Xyleme stores huge quantities of data (tera. B) – Xyleme is not a search engine (only index) or a mediator (only virtual data) • XML – Xyleme is focused on XML • Dynamic – Xyleme is interested in data evolution/changes

50 Technical Challenges 1. Data Acquisition and Maintenance discover data of interest and maintain it up to date 2. Repository store and index this data 3. Efficient query Processing 4. Semantic Integration provide a simple view of each semantic domain 5. Change Control Monitor the web and offer services such as Query Subscription

51 Technical challenges • • Scale to the web Size of data: billions of pages Size of index: terabytes Number of customers – thousands of simultaneous queries – millions of subscriptions

52 Web Heterogeneity • Semantic domains, e. g. , cinema • Many possible types for data in this domain, many DTDs • Semantic Integration – one abstract DTD for the domain – gives the illusion that the system maintains an homogeneous database for this domain 1 domain = 1 abstract DTD

53 Discover the Domains Cluster DTDs sharing similar « tags » using data mining techniques (frequent item sets) and linguistic tools (e. g. , thesaurus, heuristics to extract words from composite words or abbreviations, etc. ) to obtain domains cdtd 1. cdtd 2. cdtd 3. adtd 1 cdtd 4. cdtd 5. cdtd 6. cdtd 7. cdtd 8. cdtd 9. cdtd 10. adtd 2 Many concrete DTDs adtd 4 Fewer abstract DTDs

54 Answering queries • Choose an ADTD – Automatically, manually, hybrid • For each concrete DTD in a domain – Find how it relates to the abstract DTD – Mappings between paths in both • Distributed query processing (cluster of PCs) – Many concrete DTDs; often not possible to compute a static execution plan – Dynamic generation of execution plans [Cluet et al]

Semistructured data -- June 2001 Conclusion 55

56 One Question Only • The web is turning from a large collection of documents into a huge knowledge base When will I be able to get the precise knowledge I need? Database + Knowledge Base + Linguistic +. . .