Welldesigned XML Data Marcelo Arenas and Leonid Libkin

Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto

Outline l Part 1 Database Normalization from the 1970 s and 1980 s l Part 2 Classical theory revisited: normalizing XML documents l Part 3 Classical theory re done: new justifications for normalization

Part 1: Classical Normalization l Design: decide how to represent the information in a particular data model. • Even for simple application domains there is a large number of ways of representing the data of interest. l We have to design the schema of the database. • Set of relations. • Set of attributes for each relation. • Set of data dependencies.

Designing a Database: An Example l Attributes: number, title, section, room. l Data dependency: every course number is associated with only one title. l Relational Schema: BAD alternative: R(number, title, section, room), number title GOOD alternative: S(number, title), T(number, section, room), number title

Problems with BAD: Update Anomaly number title section room CSC 258 Computer Organization 1 LP 266 CSC 258 Computer Organization 2 GB 258 CSC 258 Computer Organization 3 GB 248 CSC 434 Database Systems 1 GB 248 Title of CSC 258 is changed to Computer Organization I.

Problems with BAD: Update Anomaly number title section room CSC 258 Computer Organization I 1 LP 266 CSC 258 Computer Organization I 2 GB 258 CSC 258 Computer Organization I 3 GB 248 CSC 434 Database Systems 1 GB 248 Title of CSC 258 is changed to Computer Organization I. The instance stores redundant information.

Deletion Anomaly number title section room CSC 258 Computer Organization I 1 LP 266 CSC 258 Computer Organization I 2 GB 258 CSC 258 Computer Organization I 3 GB 248 CSC 434 Database Systems 1 GB 248 CSC 434 is not given in this term.

Deletion Anomaly number title section room CSC 258 Computer Organization I 1 LP 266 CSC 258 Computer Organization I 2 GB 258 CSC 258 Computer Organization I 3 GB 248 CSC 434 is not given in this term. Additional effect: all the information about CSC 434 was deleted.

Insertion Anomaly number title section room CSC 258 Computer Organization I 1 LP 266 CSC 258 Computer Organization I 2 GB 258 CSC 258 Computer Organization I 3 GB 248 A new course is created: (CSC 336, Numerical Methods)

Insertion Anomaly number title section room CSC 258 Computer Organization I 1 LP 266 CSC 258 Computer Organization I 2 GB 258 CSC 258 Computer Organization I 3 GB 248 CSC 336 Numerical Methods ? ? A new course is created: (CSC 336, Numerical Methods) The instance stores attributes that are not directly related.

Avoiding Update Anomalies number title number section room CSC 258 Computer Organization CSC 258 1 LP 266 CSC 434 Database Systems CSC 258 2 GB 258 CSC 258 3 GB 248 CSC 434 1 GB 248 Title of CSC 258 is changed to Computer Organization I.

Avoiding Update Anomalies number title number section room CSC 258 Computer Organization I CSC 258 1 LP 266 CSC 434 CSC 258 2 GB 258 CSC 258 3 GB 248 CSC 434 1 GB 248 Database Systems Title of CSC 258 CSC 434 is not given is changed in this term. to Computer Organization I. The instance does not store redundant information.

Avoiding Update Anomalies number title number section room CSC 258 Computer Organization I CSC 258 1 LP 266 CSC 434 CSC 258 2 GB 258 CSC 258 3 GB 248 CSC 434 1 GB 248 Database Systems CSC 434 is not given in this term.

Avoiding Update Anomalies number title number section room CSC 258 Computer Organization I CSC 258 1 LP 266 CSC 434 CSC 258 2 GB 258 CSC 258 3 GB 248 Database Systems CSC 434 A new course is not isgiven created: in this (CSC 336, term. Numerical Methods) The title of CSC 434 is not removed from the instance.

Avoiding Update Anomalies number title number section room CSC 258 Computer Organization I CSC 258 1 LP 266 CSC 434 CSC 258 2 GB 258 CSC 258 3 GB 248 Database Systems A new course is created: (CSC 336, Numerical Methods)

Avoiding Update Anomalies number title number section room CSC 258 Computer Organization I CSC 258 1 LP 266 CSC 434 Database Systems CSC 258 2 GB 258 CSC 336 Numerical Methods CSC 258 3 GB 248 A new course is created: (CSC 336, Numerical Methods) No information about sections has to be provided. Each relation stores attributes that are directly related.

Normalization Theory l Main idea: a normal form defines a condition that a well designed database should satisfy. l Normal form: syntactic condition on the database schema. • Defined for a class of data dependencies. l Main problems: • • How to test whether a database schema is in a particular normal form. How to transform a database schema into an equivalent one satisfying a particular normal form.

BCNF: a Normal Form for FDs l Functional dependency (FD) over R(A 1, …, An) : Y, X, Y {A 1, …, An}. X l X Y : two rows with the same X values must have the same Y values. • number title : two rows with the same course number must have the same title. l Key dependency : X A 1 An • X is a key: two distinct rows must have distinct X values.

BCNF: a Normal Form for FDs l is a set of FD over R(A 1, …, An). l Relation schema R(A 1, …, An), is in BCNF if for every X Y in , X is a key. Not in BCNF: R(number, title, section, room), number title l A relational schema is in BCNF if every relation schema is in BCNF. In BCNF: S(number, title), T(number, section, room), number title

Normalization Theory Today l Normalization theory for relational databases was developed in the 70 s and 80 s. l Why do we need normalization theory today? • New data models have emerged: XML. • XML documents can contain redundant information. l Redundant information in XML documents: • Can be discovered if the user provides semantic information. • Can be eliminated.

XML Documents course @cno taken_by “CSC 258” @cno taken_by “CSC 434” student. . . student @sno @name @grade “st 1” “Fox” “B+” “A+” “st 1” “Fox”

XML Databases XML Schema: (D, ) D: courses course* course @cno course taken_by student* student @sno, @name, @grade student ε : Two students with the same @sno value must have the same name.

Redundancy in XML courses course info @cno taken_by “CSC 258” @cno taken_by “CSC 434” @sno @name “st 1” “Fox” student. . . student @sno @name @grade “st 1” “Fox” “B+” “A+” “st 1” “Fox”

XML Database Normalization DTD: Data dependency: courses course* course @cno course taken_by student* student @sno, @name, @grade student ε Two students with the same @sno value must have the same name.

XML Database Normalization DTD: Data dependency: courses course*, info* course @cno course taken_by student* student @sno, @grade student ε info @sno, @name Two students @sno is the identifier with theofsame info elements. @sno value must have the same name.

A “Non-relational” Example DBLP conf @title “ICDT” issue article @title “. . . ” @year “ 1999” @title “. . . ” conf. . . issue @year “ 1999” article @year “ 2001” @title “. . . ” @year “ 2001”

XNF: XML Normal Form l Proposed in [AL 02]. l It eliminates two types of anomalies. l It was defined for XML functional dependencies: DBLP. conf. @title DBLP. conf. issue. article. @year

Part 3: What was Missing? Justification! l What is a good database design? • Well known solutions: BCNF, 4 NF, … l But what is it that makes a database design good? • Elimination of update anomalies. • Existence of algorithms that produce good designs: lossless decomposition, dependency preservation. l Previous work was specific for the relational model. • Classical problems have to be revisited in the XML context.

Justification of Normal Forms l Problematic to evaluate XML normal forms. • No XML update language has been standardized. • No XML query language yet has the same “yardstick” status • as relational algebra. We do not even know if implication of XML FDs is decidable! l We need a different approach. • It must be based on some intrinsic characteristics of the • • data. It must be applicable to new data models. It must be independent of query/update/constraint issues. l Our approach is based on information theory.

Information Theory l Entropy measures the amount of information provided by a certain event. l Assume that an event can have n different outcomes with probabilities p 1, …, pn. Amount of information gained by knowing that event i occurred : Average amount of information gained (entropy) : Entropy is maximal if each pi = 1/n :

Entropy and Redundancies l Database schema: R(A, B, C), A B l Instance I: A B C 1 1 2 2 3 4 {1, …, l Pick a domain properly containing adom(I) : {1, …, 6} 6} • • Probability distribution: P(4) Probability distribution: P(2) = = 01 and P(a) = = 1/5, 0, a a≠ ≠ 2 4 • • Entropy: log 5 ≈ 2. 322 Entropy: log 1 = 0

Entropy and Normal Forms l Let be a set of FDs over a schema S. Theorem (S, ) is in BCNF if and only if for every instance of (S, ) and for every domain properly containing adom(I), each position carries non zero amount of information (entropy > 0). l This is a clean characterization of BCNF , but the measure is not accurate enough. . .

Problems with the Measure l The measure cannot distinguish between different types of data dependencies. l It cannot distinguish between different instances of the same schema: R(A, B, C), A B A B C 1 2 3 4 1 2 4 1 entropy = 0 1 5 entropy = 0

A General Measure Instance I of schema R(A, B, C), A B : A B C 1 2 3 1 2 4

A General Measure Instance I of schema R(A, B, C), A B : A B C 1 2 3 1 2 4 Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7.

A General Measure Instance I of schema R(A, B, C), A B : A B 1 1 C 3 2 4 Computation: Initial setting: for pickevery a position X Pos(I) p Pos(I) – {p}, compute and pick k such that adom(I) probability distribution {1, P(a …, k}. | X), For a example, {1, …, k}. k = 7.

A General Measure Instance I of schema R(A, B, C), A B : A B 1 1 C 3 2 4 Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

A General Measure Instance I of schema R(A, B, C), A B : A B C 3 1 2 Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.

A General Measure Instance I of schema R(A, B, C), A B : A B C 3 1 2 Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) =

A General Measure Instance I of schema R(A, B, C), A B : A 1 B C 2 3 2 Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) =

A General Measure Instance I of schema R(A, B, C), A B : A B C 1 2 3 1 2 1 Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) =

A General Measure Instance I of schema R(A, B, C), A B : A B C 4 2 3 1 2 7 Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) =

A General Measure Instance I of schema R(A, B, C), A B : A B C 1 2 3 Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) = 48/

A General Measure Instance I of schema R(A, B, C), A B : A B C 3 1 2 Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) = 48/ For a ≠ 2, P(a | X) =

A General Measure Instance I of schema R(A, B, C), A B : A 1 B C a 3 2 Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) = 48/ For a ≠ 2, P(a | X) =

A General Measure Instance I of schema R(A, B, C), A B : A B C 2 a 3 1 2 7 Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) = 48/ For a ≠ 2, P(a | X) =

A General Measure Instance I of schema R(A, B, C), A B : A B C 1 a 3 1 2 6 Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}. P(2 | X) = 48/ (48 + 6 42) = 0. 16 For a ≠ 2, P(a | X) = 42/ (48 + 6 42) = 0. 14 Entropy ≈ 2. 8057 (log 7 ≈ 2. 8073)

A General Measure Instance I of schema R(A, B, C), A B : A B 1 1 C 3 2 4 Value : we consider the average over all sets X Pos(I) – {p}. • Average: 2. 4558 < log 7 (maximal entropy) • It corresponds to conditional entropy. • It depends on the value of k. . .

A General Measure l Previous value: l For each k, we consider the ratio: • How close the given position p is to having the maximum possible information content. l General measure:

Basic Properties l The measure is well defined: For every set of first order constraints defined over a schema S, every I inst(S, ), and every p Pos(I): exists. l Bounds:

Basic Properties l The measure does not depend on a particular representation of constraints. If 1 and 2 are equivalent: l It overcomes the limitations of the simple measure: R(A, B, C), A B A B C 1 2 3 1 2 4 4 1 1 0. 875 5 0. 781

Well-Designed Databases Definition A database specification (S, ) is welldesigned if for every I inst(S, ) and every p Pos(I), = 1. In other words, every position in every instance carries the maximum possible amount of information. We would like to test this definition in the relational world. . .

Relational Databases is a set of data dependencies over a schema S: l = : (S, ) is well designed. l is a set of FDs: (S, ) is well designed if and only if (S, ) is in BCNF. l is a set of FDs and MVDs: (S, ) is well designed if and only if (S, ) is in 4 NF. l is a set of FDs and JDs: • If (S, ) is in PJ/NF or in 5 NFR, then (S, ) is well designed. • The converse is not true. A syntactic characterization of being well designed is given in [AL 03].

Relational Databases l If (S, ) is in DK/NF, then (S, ) is well designed. The converse is not true. l The problem of verifying whether a relational schema is well designed is undecidable. l If the schema contains only universal constraints (FDs, MVDs, JDs, …), then the problem is co NEXPTIME complete. • If each relation in S has at most m attributes, then the problem is complete. Now we would like to apply our definition in the XML world. . .

XML Databases l XML schema: (D, ). • • D is a DTD. is a set of data dependencies over D. l We would like to evaluate XML normal forms. l The notion of being well designed extends from relations to XML. • The measure is robust; we just need to define the set of positions in an XML tree T: Pos(T).

Positions in an XML Tree DBLP conf @title “ICDT” issue article @title “. . . ” @year “ 1999” @title “. . . ” @year “ 2001”

Well-Designed XML Data l We consider k such that adom(T) {1, …, k}. l For each k : l We consider the ratio: l General measure:

XNF: XML Normal Form l For arbitrary XML data dependencies: Definition An XML specification (D, ) is well designed if for every T inst(D, ) and every p Pos(T), = 1. l For functional dependencies: Theorem An XML specification (D, ) is in XNF if and only if (D, ) is well designed.

Normalization Algorithms: BCNF l Relation schema: R(X, Y, Z), • Not in BCNF: X Y and X A, for every A Z. l Basic decomposition: replace R(X, Y, Z) by S(X, Y) and T(X, Z). l Example: R(number, title, section, room), number title S(number, title), T(number, section, room), number title

Normalization Algorithms: BCNF number title section room CSC 258 Computer Organization 1 LP 266 CSC 258 Computer Organization 2 GB 258 CSC 434 Database Systems 1 GB 248 number, title (R) number, section, room (R) number title number section room CSC 258 Computer Organization CSC 258 1 LP 266 CSC 434 Database Systems CSC 258 2 GB 258 CSC 434 1 GB 248

Normalization Algorithms: BCNF number title section room CSC 258 Computer Organization 1 LP 266 CSC 258 Computer Organization 2 GB 258 CSC 434 Database Systems 1 GB 248 S T number title number section room CSC 258 Computer Organization CSC 258 1 LP 266 CSC 434 Database Systems CSC 258 2 GB 258 CSC 434 1 GB 248

Normalization Algorithms: XNF The algorithm applies two transformations until the schema is in XNF. l If there is an anomalous FD of the form: DBLP. conf. issue. article. @year then apply the “DBLP example rule”. l Otherwise: choose a minimal anomalous FD and apply the “University example rule”.

Normalization Algorithms l The information theoretic measure can also be used for reasoning about normalization algorithms. l For BCNF and XNF decomposition algorithms: Theorem After each step of these decomposition algorithms, the amount of information in each position does not decrease.

Future Work l We would like to consider more complex XML constraints and characterize good designs they give rise to. l We would like to characterize 3 NF by using the measure developed in this paper. • In general, we would like to characterize “non perfect” normal forms. l We would like to develop better characterizations of normalization algorithms using our measure. • Why is the “usual” BCNF decomposition algorithm good? Why does it always stop?

Backup Slides

XNF: XML Normal Form l Given a DTD D and a set of functional dependencies { }: (D, ) if for any XML tree T conforming to D and satisfying , it is the case that T l (D, )+ = { | (D, ) } l Functional dependency is trivial if it is implied by the DTD alone: (D, )

XNF: XML Normal Form l XML specification: a DTD D and a set of functional dependencies . l A Relational DB is in BCNF if for every non trivial functional dependency X Y in the specification, X is a key. l (D, ) is in XNF if: For each non trivial FD X p. @l in (D, )+, X p is in (D, )+.

A Normal Form for FDs and JDs Let be a set of FDs and JDs over a schema S: Theorem (S, ) is well designed if and only if for every R S and every nontrivial JD: implied by , there exists M {1, . . . , m} such that: 1. 2. For every i, j M, implies

A Normal Form for FDs and JDs (cont’d) Schema: S = { R(A, B, C) } and = { [AB, AC, BC], AB C, AC B }. l (S, ) is not in PJ/NF: {AB ABC, AC ABC} does not imply [AB, AC, BC]. l (S, ) is not in 5 NFR: [AB, AC, BC] is strong reduced and BC is not a superkey. l (S, ) is well designed.

Tree Tuples l Paths(D): all paths in a DTD D courses. course. @cno courses. course. student. @name l We distinguish two kinds of elements: attributes (@) and element types. l FDs are defined by means of a relational representation of XML documents.

XML Trees course v 0 course v 1 @cno “cs 100” @sno “ 123” . . . student v 2 @name “Fox” v 3 @grade “B+” @sno @name “ 456” “Smith” @grade “A-”

Tree Tuples Relational representation: tree tuples mappings t : Paths(D) Vertices Strings { } A tree tuple represents an XML tree: t(courses) = v 0 t(courses. course) = v 1 t(courses. course. @cno) = “cs 100” t(courses. course. student) = v 2 t(p) = , for the remaining paths courses v 0 course v 1 @cno “cs 100” student v 2

XML Tree: set of Tree Tuples course v 0 course v 1 @cno “cs 100” @sno “ 123” . . . student v 2 @name “Fox” v 3 @grade “B+” @sno @name “ 456” “Smith” @grade “A-”

Functional Dependencies for XML l Expressions of the form: X Y defined over a DTD D, where X, Y are finite non empty subsets of Paths(D). l XML tree T can be tested for satisfaction of X Y if: X Y Paths(T) Paths(D) l T X Y if for every pair u, v of tree tuples in T: u. X = v. X and u. X ≠ implies u. Y = v. Y

FD: Examples l University DTD: courses course* course @cno, student* student @sno, name, grade l Two students with the same @sno value must have the same name: courses. course. student. @sno courses. course. student. @name l Every student can have at most one grade in every course: { courses. course, courses. course. student. @sno } courses. course. student. @grade