Typing semistructured data Serge Abiteboul 2008 Master Informatique
Typing semistructured data Serge Abiteboul 2008 Master Informatique Typing semistructured data 10/9/2007 1
Organization • Motivations • Automata – Automata on words – Ranked tree automata – Unranked tree automata – Automata and monadic second-order logic – Automata – to compute • XML typing: DTD, XML schema • Graphs and bisimulation Master Informatique Master Typing semistructured data 10/9/2007 2
Motivation Master Informatique Typing semistructured data 10/9/2007 3
XML typing • Not compulsory • Simplify writing software for XML – Improve interoperability between programs • Improve storage and performance • Ease querying: data guide • Simplify data protection – Reject illegal update – like relational dependencies Master Informatique Master Typing semistructured data 10/9/2007 4
Improve storage Company person Root company works-for managed-by Company Employee c. e. o. name address name string Employee Lower-bound schema Store rest in overflow graph Master Informatique Typing semistructured data 10/9/2007 5
Improve performance Bib paper year int journal select X. title from Bib. _ X where X. *. zip = “ 12345” book address title string title author string last first zip city streetname string Master Informatique string select X. title from Bib. book X where X. address. zip = “ 12345” Typing semistructured data 10/9/2007 6
Type checking • Who checks – XML editor: check that the data conforms to its type – XML exchange, e. g. , with Web service • Server when delivering the data • Client/application: when receiving it • Dynamic verification: after the data is produced • Static verification: verification of the program that generates the data Master Informatique Master Typing semistructured data 10/9/2007 7
Static verification • Input: input type T and code of function f – f is Xquery, Xpath, XSLT, etc. • Verification of T’ – Is it true that d╞T, f(d)╞T’ ? • Type inference – Find the smallest T’ such that d╞T, f(d)╞T’ • Rapidly undecidable because of “joins” Master Informatique Master Typing semistructured data 10/9/2007 8
Example for $p in doc("parts. xml“)//part[color=“red"] return <part> <name>$p/name</name> <desc>$p/desc</desc> </part> Result type (part (name (string) desc (any) )* If the type of parts. xml//part/desc is string (part (name (string) desc (string) )* Master Informatique Master Typing semistructured data 10/9/2007 9
Difficulty for $X in Input, $Y in Input do { print ( <b/> } Input: <a/> Result: <b/> Problem: { bi i=n 2 for n ≥ 0 } cannot be described in XML schema There is no « best » result – – – b* + b 2 + b 4 + b 9 b* … Master Informatique Master Typing semistructured data 10/9/2007 10
Why tree automata? • • XML = unranked trees No theory for XML Rich theory for strings: Automata Extend to rich theory for ranked trees: Tree automata – Nice algorithms – Nice theorems – Can this carry to unranked trees and XML? • Yes! Master Informatique Master Typing semistructured data 10/9/2007 11
From strings to trees a a b b a Word Finite State Automata Master Informatique Master a b b a a a b b b a b a Binary tree… Ranked tree automata Typing semistructured data a b b Unranked tree automata no bound on number of children 10/9/2007 12
Only unranked tree automata? • Missing practical gadgets • Complexity of verification – Goal: typing at reasonable cost • Unranked tree automata + … Master Informatique Master Typing semistructured data 10/9/2007 13
Automata on words Master Informatique Typing semistructured data 10/9/2007 14
Finite state automata on words Transitions Alphabet State Initial state Master Informatique Accepting states Typing semistructured data 10/9/2007 15
Nondeterministic automaton: Example a q 0 b q 0 a q 0 q 1 Master Informatique Master a b q 0 q 1 a q 0 KO Typing semistructured data b q 0 a q 0 q 1 10/9/2007 q 0 q 2 q 1 OK 16
Reminder • Deterministic – No transition – No alternative transitions such as • Determinization – It is possible to obtain an equivalent deterministic automaton – State of new automaton = set of states of the original one – Possible exponential blow-up • Minimization • Limitations – cannot do – Context-free languages • Essential tool – e. g. , lexical analysis Master Informatique Master Typing semistructured data 10/9/2007 17
Reminder (2) • • L(A) = set of words accepted by automata A Regular languages Can be described by regular expressions, e. g. a(b+c)*d Closed under complement • Closed under union, intersection – Product automata with states (s, s’) where s is from A and s’ is from A’ Master Informatique Master Typing semistructured data 10/9/2007 18
Automata on words versus trees a Left to right a b b Right to left a B o t t o m u p b b a a No difference Master Informatique Master T o p d o w n a b Differences Typing semistructured data 10/9/2007 19
Automata on ranked trees Master Informatique Typing semistructured data 10/9/2007 20
Binary tree automata • Parallel evaluation a • For leaves: • For other nodes: B o t t o m u p q” b q’ q 2 b q b b a q 1 a q” b q q’ Typing semistructured data Master Informatique 10/9/2007 21
Bottom-up tree automata • Bottom-up: if a node labeled a has its children in states q, q’ then the node moves nondeterministically to state r or r’ • Accepts is the root is in some state in F • Not deterministic if alternatives or -transitions: Master Informatique Master Typing semistructured data 10/9/2007 22
Example: deterministic bottom-up Master Informatique Master Typing semistructured data 10/9/2007 23
Boolean circuit evaluation v 1 1 1 v 0 v v v 1 1 OK Master Informatique Master Typing semistructured data 10/9/2007 24
Regular tree language = set of trees accepted by a bottom-up tree automata Master Informatique Typing semistructured data 10/9/2007 25
Regular tree languages The following are equivalent – L is a regular tree language – L is accepted by a nondeterministic bottom-up automata – L is accepted by a nondeterministic top-down automata Deterministic top-down is weaker Master Informatique Master Typing semistructured data 10/9/2007 26
Top-down tree automata • Top-down: if a node labeled a is in state q”, then its left child moves to state q (right to q’) • Accepts is all leaves are is in states in F • Not deterministic if Master Informatique Master Typing semistructured data 10/9/2007 27
Why deterministic top-down is weaker? • Consider the language – L = { f(a, b), f(b, a) } • It can be accepted by a bottom-up TA – Exercise: write a BUTA A such that L = L(A) • Suppose that B is a deterministic top-down TA with L = L(B) – Exercise: Show that B also accepts {f(a, a)} – A contradiction Fact: No deterministic top-down tree automata accepts L Master Informatique Master Typing semistructured data 10/9/2007 28
Ranked trees automata: Properties • • Like for words only higher complexity Determinization Minimization Closed under – Complement – Intersection – Union Master Informatique Master Typing semistructured data 10/9/2007 29
But… • XML documents are unranked • The kind of things we want to do: book (intro, section*, conclusion) Master Informatique Master Typing semistructured data 10/9/2007 30
Automata on unranked tree Master Informatique Typing semistructured data 10/9/2007 31
Unranked tree automata Issue: represent an infinite set of transitions Solution: a regular language Master Informatique Master Typing semistructured data 10/9/2007 32
Unranked tree automata (2) • Rule: • Meaning: if the states of the children of some node labeled a form a word in L(Q), this node moves to some state in {r 1, …, rm} Master Informatique Master Typing semistructured data 10/9/2007 33
Building on ranked trees a a b b b a b b Ranked tree: First. Child-Next. Sibling F: encoding into a ranked tree • F is a bijection F-1: decoding Master Informatique Master Typing semistructured data 10/9/2007 34
Building on bottom-up ranked trees (2) • For each Unranked TA A, there is a Ranked TA accepting F(L(A)) • For each Ranked TA A, there is an unranked TA accepting F-1(L(A)) • Both are easy to construct Consequence: Unranked TA are closed under union, intersection, complement Master Informatique Master Typing semistructured data 10/9/2007 35
Determinization • Determinization always possible for bottom-up • Can we use the First. Child-Next. Sibling encoding – No: it does not preserve determinism Master Informatique Master Typing semistructured data 10/9/2007 36
Top-down? • This is more delicate • Transition (a, q)=A(a, q) – The state of the automata A(a, q) when reading the labels of the children of a node labeled a determines the states of the children of that node – Accepts if all the leaves are in accepting state Master Informatique Master Typing semistructured data 10/9/2007 37
Boolean circuit evaluation It is accepted It rejects by if some state of a leaf is neither 0 with q 0 nor 1 with q 1 v v 1 0 1 1 v 0 Master Informatique Master 1 0 1 v v 1 v 0 1 v v 1 1 Typing semistructured data 10/9/2007 38
Automata and monadic second-order logic Master Informatique Typing semistructured data 10/9/2007 39
Monadic second-order logic • Representation of a tree as a logical structure a 1 b 2 b 3 a 4 b 5 b 6 b 7 a 8 b 9 E(1, 2), E(1, 3)… E(3, 9) S(2, 3), S(3, 4), S(4, 5)…S(8, 9) a(1), a(4), a(8) b(2), b(3), b(5), b(6), b(7), b(9) Master Informatique Master Typing semistructured data 10/9/2007 40
Monadic second-order logic E(1, 2), E(1, 3)… E(3, 9) S(2, 3), S(3, 4), S(4, 5)…S(8, 9) a(1), a(4), a(8) b(2), b(3), b(5), b(6), b(7), b(9) MSO syntax Quantification over a set variable Set variable Master Informatique Master Typing semistructured data 10/9/2007 41
Example of MSO • Each a node has a b-descendant • This corresponds to the formula For each node x labeled a: each set X that ( ) contains x and that ( ) is closed under descendant, X contains some y labeled b Master Informatique Master Typing semistructured data 10/9/2007 42
Bridge Theorem: for a set L of trees, the following are equivalent 1. L = L(A) for some bottom-up tree automata A i. e. L is definable with bottom-tree automata 2. L = {T | T satisfies } for some MSO formula i. e. L is definable in MSO Master Informatique Master Typing semistructured data 10/9/2007 43
XML typing DTDs Master Informatique Typing semistructured data 10/9/2007 44
DTD • Describe the children of a node of a label a by a regular expression • Bizarre syntax <!ELEMENT <!ELEMENT Master Informatique Master populationdata (continent*) > continent (name, country*) > country (name, province*)> province (name, city*) > city (name, pop) > name (#PCDATA) > pop (#PCDATA) > Typing semistructured data 10/9/2007 45
DTD and deterministism • Regular expressions in DTD should be deterministic – Complicated definition • Intuition: the corresponding automata should be deterministic – (a+b)*a is not – When reading <a>, one cannot tell whether it is an a from (a+b) or if it is the a of the end – (b*a)* is an equivalent expression that is deterministic Master Informatique Master Typing semistructured data 10/9/2007 46
Very efficient validation • It suffices to verify for each node a that the word formed by the labels of its children is accepted by the finite state automata Aa • Possible to type check the document while scanning it, e. g. with SAX parser Master Informatique Master Typing semistructured data 10/9/2007 47
Very efficient validation (2) <!ELEMENT a ( b c ) > <!ELEMENT b ( d+ ) > <a><b><d/></b><c/></a> a b d Aa s d t b Ab c s’ Master Informatique Master u c d t’ s’ t’ ust Accept d Typing semistructured data 10/9/2007 48
Warning • The previous example can be checked with a simple automata on words • But not the following one <!ELEMENT part ( part* ) > • The stack is needed for accepting <a>…<a></a>…</a> n </a> Master Informatique Master Typing semistructured data 10/9/2007 49
Some bad news for DTD • Not closed under union DTD 1 … <!ELEMENT DTD 2 … <!ELEMENT used( ad*) > ad ( year, brand )> new( ad*) > ad ( brand )> • L(DTD 1) L(DTD 2) cannot be described by a DTD but can be described easily by a tree automata – Problem with the type of ad that depends of its parent • Also not closed under complement • Limited expressive power Master Informatique Master Typing semistructured data 10/9/2007 50
Car example continued Car Used New Brand Year Brand “Renault” “ 2008” “BMW” • The best DTD we can choose does not distinguish between ads for used and new cars – <!ELEMENT ad (year? , brand) > Master Informatique Master Typing semistructured data 10/9/2007 51
Decoupled types in XML schema • Each type corresponds to a label, not conversely car: [car] ( used + new )* used: [used] new: [new] ad 1: [ad] ad 2: [ad] (ad 1*) (ad 2*) (year, brand) (brand) • The tags are in green; type names in blue • Nice closure properties • Many other « gadgets » in XML schemas Master Informatique Master Typing semistructured data 10/9/2007 52
XML typing XML Schemas Master Informatique Typing semistructured data 10/9/2007 53
XML Schema • Often criticized & unnecessarily complicated • Boosted by Web services • • Richer than DTD – decoupled types Deterministic top-down tree automata (close to) XML schemas are extensible Many other useful functionalities – Namespaces – Atomic types – Integrity constraints, etc. Master Informatique Master Typing semistructured data 10/9/2007 54
An XML schema is an XML document • Since it is an XML syntax, it can use XML tools – Editor – Type checker – Etc. • The type of all XML schemas can be described with an XML schema Master Informatique Master Typing semistructured data 10/9/2007 55
<? xml version="1. 0" encoding="utf-8"? > <xs: schema xmlns: xs="http: //www. w 3. org/2001/XMLSchema" targetnamespace="http: //www. net-language. com"> <xs: element name="book"> <xs: complex. Type> <xs: sequence> <xs: element name="title" type="xs: string"/> <xs: element name="author" type="xs: string"/> <xs: element name="character" min. Occurs="0" max. Occurs="unbounded"> <xs: complex. Type> <xs: sequence> <xs: element name="name" type="xs: string"/> <xs: element name="friend-of" type="xs: string" min. Occurs="0" max. Occurs="unbounded"/> <xs: element name="since" type="xs: date"/> <xs: element name="qualification" type="xs: string"/> </xs: sequence> </xs: complex. Type> </xs: element> </xs: sequence> <xs: attribute name="isbn" type="xs: string"/> </xs: complex. Type> </xs: element> </xs: schema> Master Informatique Typing semistructured data 10/9/2007 56
Simple elements and atomic types Definition: <xs: element name="xxx" type="yyy"/> with common types: xs: string; xs: decimal; xs: integer; xs: boolean; xs: date; xs: time Examples <xs: element name="lastname" type="xs: string"/> <xs: element name="age" type="xs: integer"/> <xs: element name="dateborn" type="xs: date"/> Instances of such elements <lastname>Refsnes</lastname> <age>34</age> <dateborn>1968 -03 -27</dateborn> Master Informatique Master Typing semistructured data 10/9/2007 57
Attributs Definition: <xs: attribute name="xxx" type="yyy"/> Example <xs: attribute name="lang" type="xs: string"/> Instance of such attribute <lastname lang="EN">Smith</lastname> Master Informatique Master Typing semistructured data 10/9/2007 58
Complex elements • Empty element <product pid="1345"/> • Contains only other elements <employee> <firstname>John</firstname> <lastname>Smith</lastname> </employee> • Contains only text <food type="dessert">Ice cream</food> • Contains both elements and text <description> It happened on <date lang="norwegian"> 03. 99</date>. . </description> Master Informatique Master Typing semistructured data 10/9/2007 59
Restriction of simple elements <xs: element name="age"> <xs: simple. Type> <xs: restriction base="xs: integer"> <xs: min. Inclusive value="0"/> <xs: max. Inclusive value="100"/> </xs: restriction> </xs: simple. Type> </xs: element> Other restrictions: enumerated types, patterns, etc. Master Informatique Master Typing semistructured data 10/9/2007 60
Restriction on complex elements <xs: element name="person"> <xs: complex. Type> <xs: sequence> <xs: element name="firstname" type="xs: string"/> <xs: element name="lastname" type="xs: string"/> </xs: sequence> </xs: complex. Type> </xs: element> Master Informatique Master Typing semistructured data 10/9/2007 61
Possible to name a type <xs: element name="employee"> <xs: complex. Type> <xs: sequence> <xs: element name="firstname" type="xs: string"/> <xs: element name="lastname" type="xs: string"/> </xs: sequence> </xs: complex. Type> </xs: element> Only the "employee" element can use the specified complex type (<sequence> indicates an order on child elements) Alternative <xs: element name="employee" type="personinfo" /> <xs: complex. Type name="personinfo"> <xs: sequence> <xs: element name="firstname" type="xs: string"/> <xs: element name="lastname" type="xs: string"/> </xs: sequence> </xs: complex. Type> Typing semistructured data Master Informatique 10/9/2007 62
Other gadgets • Import of types associated to a namespace – <import name. Space = "http: //. . . " schema. Location = "http: //. . . " /> • Possible to include an existing schema – <include schema. Location="http: //. . . "/> • Possible to extend/redefine an existing schema – <redefine schema. Location="http: //. . . "/>. . Extensions. . . </redefine> Master Informatique Master Typing semistructured data 10/9/2007 63
Example: a DTD <!ELEMENT EMAIL (TO+, FROM, CC*, BCC*, SUBJECT? , BODY? )> <!ATTLIST EMAIL LANGUAGE (Western|Greek|Latin|Universal) "Western" ENCRYPTED CDATA #IMPLIED PRIORITY (NORMAL|LOW|HIGH) "NORMAL"> <!ELEMENT TO (#PCDATA)> <!ELEMENT FROM (#PCDATA)> <!ELEMENT CC (#PCDATA)> <!ELEMENT BCC (#PCDATA)> <!ATTLIST BCC HIDDEN CDATA #FIXED "TRUE"> <!ELEMENT SUBJECT (#PCDATA)> <!ELEMENT BODY (#PCDATA)> <!ENTITY SIGNATURE "Bill"> Master Informatique Master Typing semistructured data 10/9/2007 64
The same in XML schema (more verbose) <? xml version="1. 0" ? > <Schema name="email" xmlns="urn: schemas-microsoft-com: xml-data" xmlns: dt="urn: schemas-microsoft-com: datatypes"> <Attribute. Type name="language" dt: type="enumeration" dt: values="Western Greek Latin Universal" /> <Attribute. Type name="encrypted" /> <Attribute. Type name="priority" dt: type="enumeration" dt: values="NORMAL LOW HIGH" /> <Attribute. Type name="hidden" default="true" /> <Element. Type name="to" content="text. Only" /> <Element. Type name="from" content="text. Only" /> <Element. Type name="cc" content="text. Only" /> <Element. Type name="bcc" content="mixed"> <attribute type="hidden" required="yes" /> </Element. Type> <Element. Type name="subject" content="text. Only" /> <Element. Type name="body" content="text. Only" /> <Element. Type name="email" content="elt. Only"> <attribute type="language" default="Western" /> <attribute type="encrypted" /> <attribute type="priority" default="NORMAL" /> <element type="to" min. Occurs="1" max. Occurs="*" /> <element type="from" min. Occurs="1" max. Occurs="1" /> <element type="cc" min. Occurs="0" max. Occurs="*" /> <element type="bcc" min. Occurs="0" max. Occurs="*" /> <element type="subject" min. Occurs="0" max. Occurs="1" /> <element type="body" min. Occurs="0" max. Occurs="1" /> </Element. Type> </Schema> Master Informatique Master Typing semistructured data 10/9/2007 65
Where to place XML schemas DTD XML schema Deterministic top-down tree automata . Tree automata • Some bizarre restriction – Inside an element, no two types with the same tag • Closer to DTDs than to tree automata • Efficient type validation Master Informatique Master Typing semistructured data 10/9/2007 66
Exercise: coupled vs decoupled • Write a realistic DTD 1 for new cars – With make, model, engine… • Write a realistic DTD 2 for used cars – Also year, miles, zipcode • Write an XML schema for L(DTD 1) L(DTD 2) – Using decoupled schema Master Informatique Master Typing semistructured data 10/9/2007 67
Automata to compute Master Informatique Typing semistructured data 10/9/2007 68
Another use of automata: XPATH $x in //a/b b (0) a a b $x NFA Master Informatique Master $x a b DFA Typing semistructured data 10/9/2007 69
Example: //a/b b (0) (01) a a b $x NFA Master Informatique Master $x a b DFA Typing semistructured data 10/9/2007 70
Example: //a/b b a a b $x NFA Master Informatique Master $x a b (0) (01) a b DFA Typing semistructured data 10/9/2007 71
Example: //a/b b a a b $x $x NFA Master Informatique Master $x a b (0) (01) (02) a b DFA Typing semistructured data 10/9/2007 72
Example: //a/b b a a b $x $x NFA Master Informatique Master $x a b (0) (01) a b DFA Typing semistructured data 10/9/2007 73
Example: //a/b b (0) (01) a a b $x $x NFA Master Informatique Master $x a b DFA Typing semistructured data 10/9/2007 74
Example: //a/b b a a b $x $x NFA Master Informatique Master $x a b (0) (01) a b DFA Typing semistructured data 10/9/2007 75
Example: //a/b b (0) (01) a a b $x $x NFA Master Informatique Master $x a b DFA Typing semistructured data 10/9/2007 76
Example: //a/b b a a b $x $x NFA Master Informatique Master $x a b $x (0) (01) (02) a b DFA Typing semistructured data 10/9/2007 77
Example: //a/b b a a b $x $x NFA Master Informatique Master $x a b $x (0) (01) (02) (01) a b DFA Typing semistructured data 10/9/2007 78
Example: //a/b b a a b $x $x NFA Master Informatique Master $x a b $x a (0) (01) (02) b $x DFA Typing semistructured data 10/9/2007 79
Example: //a/b b a a b $x $x NFA Master Informatique Master $x a b $x (0) (01) (02) (01) a b $x DFA Typing semistructured data 10/9/2007 80
Example: //a/b b a a b $x $x NFA Master Informatique Master $x a b $x (0) (01) (02) a b $x DFA Typing semistructured data 10/9/2007 81
Example: //a/b b (0) (01) a a b $x $x NFA Master Informatique Master $x a b $x DFA Typing semistructured data 10/9/2007 82
Example: //a/b b (0) a a b $x $x NFA Master Informatique Master $x a b $x DFA Typing semistructured data 10/9/2007 83
Determinization: exponential blow up //a/*/*/b Master Informatique Master Typing semistructured data Typing semistructured 10/9/2007 84
Proposal : k-pebble transducers stack [milo, suciu, vianu] Master Informatique Master Typing semistructured data 10/9/2007 85
k-pebble transducers: result root a c b a a a b b Capture a core aspect of Xquery but not the data management part Master Informatique Master Typing semistructured data 10/9/2007 86
Graphs and bisimulation Master Informatique Typing semistructured data 10/9/2007 87
Graph • • Graph semistructured data Graph simulation Graph bisimulation Data guides Master Informatique Master Typing semistructured data 10/9/2007 88
Semistructured data • With ID-IDREF, XML is a graph model as well • OEM = Object Exchange Model Labeled (rooted) graph (E, r) – Set N of nodes – A finite ternary relation E N N Label E(s, t, l) = there is an edge from s to t labeled l – Possibly a root r Master Informatique Master Typing semistructured data 10/9/2007 89
&r employee employee manages manages &p 1 &p 2 managedby &p 3 &p 4 &p 8 managedby worksfor &p 6 &p 7 managedby worksfor company &p 5 worksfor &c Master Informatique Master Typing semistructured data 10/9/2007 90
Equality revisited • {1, 2, 2, 1, 5} = {1, 2, 5} – Ignores the order • For trees, if we ignore the order of siblings and use a “set” semantics a b d Master Informatique Master = c d a b d c d Typing semistructured data 10/9/2007 91
Simulation A simulation of (E, r) with (E’, r’) is a relation between the nodes of E and E’ such that 1. (r, r’) 2. if (s, s’) and E(s, t, l) for some l then there exists t’ with (t, t’) and E’(s’, t’, l’) (we simulate a move in E by a move in E’) Master Informatique Master Typing semistructured data 10/9/2007 92
Bisimulation Given , E, E’, is a bisimulation if is a simulation of E with E’ and -1 is a simulation of E’ with E Master Informatique Master Typing semistructured data 10/9/2007 93
Examples bisimulation Not bisimulation a a d a G a a a d G’ a d a G” They all have the same paths from the root Master Informatique Master Typing semistructured data 10/9/2007 94
root programmer employee statistician c 1 employee c 2 employee project e 1 e 2 workson leads R p 1 "exercise" workson p 2 p 3 "lecture" "finance" e 3 leads e 4 workson consults workson leads p 4 p 5 "adminstr. " "PR" p 6 p 7 "undergrad" "grad" workson consults leads p 8 p 9 "postgrad" "web" programmer | statistician employee t 1 t 2 _ STRING projects Master Informatique Master Typing semistructured data 10/9/2007 95
Graph bisimulation root programmer employee statistician c 1 employee c 2 employee project e 1 e 2 workson leads R p 1 "exercise" workson p 2 p 3 "lecture" "finance" e 3 leads e 4 workson consults workson leads p 4 p 5 "adminstr. " "PR" p 6 p 7 "undergrad" "grad" workson consults leads p 8 p 9 "postgrad" "web" programmer | statistician t 1 employee t 2 _ STRING projects Master Informatique Master Typing semistructured data 10/9/2007 96
Graph bisimulation root programmer employee statistician c 1 employee c 2 employee project e 1 e 2 workson leads R p 1 "exercise" workson p 2 p 3 "lecture" "finance" e 3 leads e 4 workson consults workson leads p 4 p 5 "adminstr. " "PR" p 6 p 7 "undergrad" "grad" workson consults leads p 8 p 9 "postgrad" "web" programmer | statistician t 1 employee t 2 _ STRING projects Master Informatique Master Typing semistructured data 10/9/2007 97
Graph bisimulation root programmer employee statistician c 1 employee c 2 employee project e 1 e 2 workson leads R p 1 "exercise" workson p 2 p 3 "lecture" "finance" e 3 leads e 4 workson consults workson leads p 4 p 5 "adminstr. " "PR" p 6 p 7 "undergrad" "grad" workson consults leads p 8 p 9 "postgrad" "web" programmer | statistician employee t 1 t 2 _ STRING projects Master Informatique Master Typing semistructured data 10/9/2007 98
Graph bisimulation root programmer employee statistician c 1 employee c 2 employee project e 1 e 2 workson R leads p 1 "exercise" workson p 2 p 3 "lecture" "finance" e 3 leads e 4 workson consults workson leads p 4 p 5 p 6 "adminstr. " "PR" p 7 "undergrad" "grad" workson consults leads p 8 p 9 "postgrad" "web" programmer | statistician R employee t 1 t 2 _ STRING projects Master Informatique Master Typing semistructured data 10/9/2007 99
Graph bisimulation root programmer employee statistician c 1 employee c 2 employee project e 1 e 2 workson leads p 1 "exercise" workson p 2 p 3 "lecture" "finance" e 3 leads e 4 workson consults workson leads p 4 p 5 p 6 "adminstr. " "PR" p 7 "undergrad" "grad" workson consults leads p 8 p 9 "postgrad" "web" programmer | statistician R employee t 1 t 2 _ STRING projects Master Informatique Master Typing semistructured data 10/9/2007 100
Graph bisimulation root programmer employee statistician c 1 employee c 2 employee project e 1 e 2 workson leads R p 1 "exercise" workson p 2 p 3 "lecture" "finance" e 3 leads e 4 workson consults workson leads p 4 p 5 "adminstr. " "PR" p 6 p 7 "undergrad" "grad" workson consults leads p 8 p 9 "postgrad" "web" programmer | statistician R employee t 1 t 2 _ STRING projects Master Informatique Master Typing semistructured data 10/9/2007 101
Computing bisimulation in ptime • Start with = N N’ (for N, N’ the set of nodes) • While there exists (x, x’) in that violate the definition of simulation, remove (x, x’) from • This computes the maximal bisimulation in ptime (Note: this maximal bisimulation exists because is a bisimulation, and if 1, 2 are bisimulation, 1 2 is also one) Master Informatique Master Typing semistructured data 10/9/2007 102
What does this have to do with typing? • Take a very complex graph E • How do you describe it? • By a “smaller” graph T that is a bisimulation of E • There may be several bisimulation with more and more details Master Informatique Master Typing semistructured data 10/9/2007 103
Rough bisimulation Root &r employee company employee Bosses &p 1, &p 4, &p 6 worksfor Company &c Master Informatique Master manages Regulars &p 2, &p 3, &p 5, &p 7, &p 8 managedby worksfor Typing semistructured data 10/9/2007 104
More precise one Root &r company managedby worksfor Bosses &p 1, &p 4, &p 6 worksfor Company &c Master Informatique Master employee Employees &p 1, &p 3, P 4 &p 5, &p 6, &p 7, &p 8 manages Regulars &p 2, &p 3, &p 5, &p 7, &p 8 managedby worksfor Typing semistructured data 10/9/2007 105
Other “typing”: data guide • See the graph as an automata with root as the start symbol and only accepting states • This graph accepts all the paths from the root • Obtain an equivalent, minimal, deterministic automata – This is the data guide for the graph – It can be used for describing the data – It can be used to support Graphical Query Interfaces Master Informatique Master Typing semistructured data 10/9/2007 106
Data guide • Gives all the paths from the root • Automata minimization Master Informatique Master Typing semistructured data 10/9/2007 107
{root} programmer statistician employee project {p 1, p 2, p 3, p 4, p 5, p 6, p 7, p 8, p 9} {c 1} employee {e 1, e 2} {c 2} workson {p 1, p 3, p 5, p 7, p 9} {e 2, e 3} {e 1, e 2, e 3, e 4} leads consults {p 2, p 4, p 6, p 8} {p 4, p 9} root programmer c 1 employee statistician employee workson c 2 employee {p 1, p 3} leads workson leads consults {p 2, p 4} {p 1, p 3, p 5, p 7} {p 4, p 6} {p 4} project e 1 e 2 workson e 3 workson leads workson p 1 "exercise" p 2 p 3 consults e 4 workson p 5 p 6 consults leads worksonleads p 4 • workson p 7 p 8 p 9 "lecture" "finance""adminstr. ""PR" "undergrad" "postgrad" "web" Master Informatique Master Typing semistructured data 10/9/2007 108
What you should remember • • • Tree automata = theoretical foundation for XML Bottom-up tree automata are nice Top-down and determinism together limitations XML documents do not have to be typed Typing may be very useful for XML – In particular for software managing XML data • DTD: simple but limited • XML Schema: more expressive but still limited • Graph data: bisimulation is the answer Master Informatique Master Typing semistructured data 10/9/2007 109
Merci Master Informatique Typing semistructured data 10/9/2007 110
Bibliography • TATA: the book, Tree Automata Techniques and Applications, tata. gforge. inria. fr/ – The book on the topic and it is free • XML schema, see http: //w 3. org http: //www. w 3 schools. com/schema/ Master Informatique Master Typing semistructured data 10/9/2007 111
- Slides: 111