XML Data Management Deterministic DTDs and Schemas Werner

How Expressive can a Schema Be? <xsd: element name=“A” type=“one. B”/> <xsd: complex. Type

Here is the Full Error Message from Eclipse • cos-element-consistent: Error for type 'one.

The Country Example in XML Schema <? xml version="1. 0" encoding="UTF-8"? > <xsd: schema

Also this is not validated … • cos-nonambig: king and king (or elements from

What the W 3 C Standard Explains … Schema Component Constraint: Unique Particle Attribution

Questions and Ideas Questions: • How can one make the standard formal? • How

Formalization • Alphabet (i. e. , set of symbols): In the following, we denote

Regular Expressions and DTDs These are formalizations of DTDs and validation: A DTD is

Markings Distinguish between the different occurrences of a symbol in a regexp by using

Unmarked Version Consider a regular expression e and a e marking of e Definition:

“Unique Particle Attribution”: Formalization Brüggemann-Klein/Wood [1998] Definition: A regular expression r is deterministic iff

Finite State Automata The automaton is deterministic if every pair (q, a) is •

Which Language Does this FSA Define? b a q 2 q 0 q 1

Non-Deterministic Automata • An automaton is non-deterministic if there is a state q and

Creating a Glushkov Automaton from a Regular Expression Step 1: Create a marking of

Creating a Glushkov Automaton from a Regular Expression Step 2: Create a state q

Creating a Glushkov Automaton from a Regular Expression Step 4: Create a transition from

Exercises What are the Glushkov automata of • a* b (a b)* • (a

Recognizing Deterministic Regular Expressions Theorem (Book et al 1971, Brüggemann-Klein, Wood, 1998) A regular

Construction of the Glushkov Automaton For an arbitrary alphabet and a language L *

Construction of the Glushkov Automaton Where do we get this info? If e =

Recognizing Deterministic Regular Expressions Observation: • For each operator, first, last, and follow can

More Results Theorems (Brüggemann-Klein, Wood, 1998) • Not every regular language can be denoted

Theory for XML Schema XML schema allows schemas where • the same element appears

References This material draws upon slides by • Sara Cohen • Frank Neven, notes

Slides: 27

Download presentation

XML Data Management Deterministic DTDs and Schemas Werner Nutt

How Expressive can a Schema Be? <xsd: element name=“A” type=“one. B”/> <xsd: complex. Type name=“only. As”> <xsd: choice> <xsd: sequence> <xsd: element name=“A” type=“only. As”/> </xsd: sequence> <xsd: element name=“A” type=“xsd: string”/> </xsd: choice> </xsd: complex. Type> This schema is a frequent example in teaching material on XML Schema <xsd: complex. Type name=“one. B”> <xsd: choice> <xsd: element name=“B” type=“xsd: string”/> <xsd: sequence> <xsd: element name=“A” type=“only. As”/> <xsd: element name=“A” type=“one. B”/> </xsd: sequence> <xsd: element name=“A” type=“one. B”/> <xsd: element name=“A” type=“only. As”/> </xsd: sequence> </xsd: choice> </xsd: complex. Type> What would documents look like that satisfy this schema? Arbitrary deep binary tree with A elements, and a single B element How would one check validity? What would be the cost? What are the pros and cons of allowing such schemas?

Let’s see what SAXON says …

Here is the Full Error Message from Eclipse • cos-element-consistent: Error for type 'one. B'. Multiple elements with name 'A', with different types, appear in the model group. I. e. , in a given context, • cos-element-consistent: Error for type 'only. As'. elements with the same name Multiple elements with name 'A', with different types, must have the same content. appear in the model group. Easy to check! • cos-nonambig: A and A (or elements from their substitution group) violate "Unique Particle Attribution". During validation against this That’s more subtle. . . schema, ambiguity would be created for those two particles. • cos-nonambig: A and A (or elements from their substitution group) violate "Unique Particle Attribution". During validation against this schema, ambiguity would be created for those two particles.

The Country Example in XML Schema <? xml version="1. 0" encoding="UTF-8"? > <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema" target. Namespace="http: //www. example. org/country" xmlns="http: //www. example. org/country" element. Form. Default="qualified"> <xsd: element name="country"> <xsd: complex. Type> <xsd: choice> <xsd: element name="king" type="xsd: string"></xsd: element> <xsd: element name="queen" type="xsd: string"></xsd: element> <xsd: sequence> <xsd: element name="king" type="xsd: string"></xsd: element> <xsd: element name="queen" type="xsd: string"></xsd: element> </xsd: sequence> </xsd: choice> </xsd: complex. Type> </xsd: element> </xsd: schema> As DTD: <!ELEMENT country (king | queen | (king, queen))>

Also this is not validated … • cos-nonambig: king and king (or elements from their substitution group) violate "Unique Particle Attribution". During validation against this schema, ambiguity would be created for those two particles. Let’s check what this means!

What the W 3 C Standard Explains … Schema Component Constraint: Unique Particle Attribution A content model must be formed such that during ·validation· of an element information item sequence, the particle contained directly, indirectly or ·implicitly· therein with which to attempt to ·validate· each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence. http: //www. w 3. org/TR/2001/REC-xmlschema-1 -20010502/#cos-nonambig

Questions and Ideas Questions: • How can one make the standard formal? • How can a validator implement the standard? Ideas: • Content models are specified by regular expressions • A regular expression E can be translated into a finite state automaton A (Glushkov automaton) that checks which strings satisfy E Construct A from E and check whether A is deterministic

Formalization • Alphabet (i. e. , set of symbols): In the following, we denote the element names occurring in the content model concatenation by a dot, no more comma. • Regular expressions over are generated withby thearule e, f a | (e f) | (e|f) | (e)+ | (e)* where e, f are expressions and a • Language L(e) of an expression e (inductively defined) • Exercise: Which of the following are in the language defined by a* (b | c) a+ ? – aba – abca – aab – aaacaaa

Regular Expressions and DTDs These are formalizations of DTDs and validation: A DTD is a pair (d, s) where • s is the start symbol • d maps every -symbol to a regular expression over A document tree t satisfies d (t is valid wrt d) iff • the root of t is labeled s • for every node n in t, with symbol a, the string formed by the names of the children of n satisfies d(a) Validation is checking whether a string satisfies a regexp

Markings Distinguish between the different occurrences of a symbol in a regexp by using numbers: markings of regexps Examples: • a 1* (b 2 | c 3) a 4+ is a marking of a* (b | c) a+ • king 1 | queen 2 | king 3 queen 4 is a marking of king | queen | king queen Definition A marking e′ of a regular expression e is an assignment of numbers to every symbol in e.

Unmarked Version Consider a regular expression e and a e marking of e Definition: For w L(e ) , we denote by w# the corresponding unmarked string in L(r). Example: If w = b 2 a 1 a 3, then w# = baa

“Unique Particle Attribution”: Formalization Brüggemann-Klein/Wood [1998] Definition: A regular expression r is deterministic iff there are no strings uxv, uyw ∈ L(r′) with • |x| = |y| = 1 • x y, (x and y are different marked symbols) • x# = y# (their unmarking is the same). Example: (a | b)* a is not deterministic because there are • marking ((a 1 + b 2)∗ a 3) • strings b 2 a 1 a 3 and b 2 a 3 u x v u x w How can we check, whether e is deterministic?

Finite State Automata The automaton is deterministic if every pair (q, a) is • Regular anguages can also be defined using automata only mapped to a single state • A finite state automaton (FSA) consists of: – a set of states Q. – an alphabet (i. e. , a set of symbols) – a transition function , which maps every pair (q, a) to a set of states q’ – an initial state q 0 – a set of accepting states F • A word a 1…an is in the language defined by an automaton if there is a path from q 0 to a state in F with edges labeled a 1, …, an

Which Language Does this FSA Define? b a q 2 q 0 q 1 a b c q 3

Non-Deterministic Automata • An automaton is non-deterministic if there is a state q and a letter a such that there at least two transitions from q via edges labeled with a What words are in the language of a non-deterministic automaton? • We now create a Glushkov automaton from a regular expression

Creating a Glushkov Automaton from a Regular Expression Step 1: Create a marking of the expression a* (b|c) a+ a 1* (b 1|c 1) a 2+

Creating a Glushkov Automaton from a Regular Expression Step 2: Create a state q 0 and create a state for each subscripted letter a 1* (b 1|c 1) a 2+ Step 3: Choose as accepting states all subscripted letters with which it is possible to end a word How do we find these states? b 1 q 0 a 1 a 2 c 1

Creating a Glushkov Automaton from a Regular Expression Step 4: Create a transition from a state lj to a state kj if there is a word in which kj follows li. a 1* (b 1|c 1) a 2+ Label the transition with k How do we find these transitions? b 1 q 0 a 1 a 2 c 1

Exercises What are the Glushkov automata of • a* b (a b)* • (a | b)* a (a | b) • (a | b)* a ?

Recognizing Deterministic Regular Expressions Theorem (Book et al 1971, Brüggemann-Klein, Wood, 1998) A regular expression is deterministic (one-unambiguous) iff its Glushkov automaton is deterministic.

Construction of the Glushkov Automaton For an arbitrary alphabet and a language L * we define two sets first(L) = a u *. a u L last(L) = a u *. u a L and the function follow(L, a) = b u, v *. u a b v L. Consider an expression e and its marking e We can construct the Glushkov automaton for e if we know the sets first(L(e )) , last(L(e )) , the function follow(L(e ), ) , and if we know whether (L(e )). empty word Why?

Construction of the Glushkov Automaton Where do we get this info? If e = a 1 , then • first(L(e )) = a 1 • last(L(e )) = a 1 • follow(L(e ), ) is not defined for any li Also, L( e ) If e = (f | g) , then • first(L(e )) = first(L(f)) first(L(g)) • last(L(e )) = last(L(f)) last(L(g)) • follow(L(e ), li) is follow(L(f), li) if li L(f) and follow(L(g), li) if li L(g) Also, L(e ) if L(f) or L(g) For e = f*, f+, f g, exercise!

Recognizing Deterministic Regular Expressions Observation: • For each operator, first, last, and follow can be computed in quadratic time. This yields an O(n 3) algorithm. Theorem (Brüggemann-Klein, Wood, 1998) • There is an O(n 2) algorithm to check whether a regexp is deterministic.

More Results Theorems (Brüggemann-Klein, Wood, 1998) • Not every regular language can be denoted by a deterministic regular expression. E. g. , (a | b)* a (a | b) • Deterministic regular languages are not closed under union, concatenation, or Kleene-star. I. e. , there is no easy syntactic characterization • If it exists, an equivalent deterministic regular expression can be constructed in exponential time. It is possible to help users, but that is costly

Theory for XML Schema XML schema allows schemas where • the same element appears with different types However, • it is illegal to have two elements of the same name, but different types in one content model. Also, content models must be deterministic. Consequence: Documents can be validated in a deterministic top-down pass

References This material draws upon slides by • Sara Cohen • Frank Neven, notes by • Leonid Libkin and the papers by A. Brüggemann-Klein and D. Wood