Schema Refinement and Normal Forms Why schema refinement

  • Slides: 32
Download presentation
Schema Refinement and Normal Forms

Schema Refinement and Normal Forms

Why schema refinement? q Consider relation obtained from Hourly_Emps: I Hourly_Emps (sin, rating, hourly_wages,

Why schema refinement? q Consider relation obtained from Hourly_Emps: I Hourly_Emps (sin, rating, hourly_wages, hourly_worked) q Problems: I Update Anomaly: Can we change the wages only in the 1 st tuple? I Insertion Anomaly: What if we want to insert an employee and don’t know the hourly wage for her/his rating? I Insertion Anomaly II: We can only introduce a new rating scheme when we enter an employee with this rating into the relation I Deletion Anomaly: If we delete all employees with rating 5, we lose the information about the wage for rating 5! sin 123 -22 -3661 231 -31 -5368 434 -26 -3151 612 -37 -2814 421 B: Database Systems - Functional Dependencies rat wages hours 8 8 5 8 10 10 15 10 40 30 37 42 2

Functional Dependencies (FDs) q A functional dependency X Y (X and Y are sets

Functional Dependencies (FDs) q A functional dependency X Y (X and Y are sets of attributes; X, Y schema(R)) holds over relation R if for every allowable instance r of R: I Given two tuples t 1 and t 2 in instance r of R, if the X values of t 1 and I q t 2 agree, then the Y values must also agree t 1 r, t 2 r, X(t 1) = X (t 2) implies Y (t 1) = Y (t 2) FDs are determined by the semantics of the application I Let schema(R) = (a 1, a 2, …an); a set of attributes X schema(R) is a candidate key for a relation R if X R (= X a 1, a 2, …an) i. e. , X implies all attributes of R (typically, the key K of an entity-set has been chosen exactly because K R) I Relationship sets (many-one, one-one) deduce FDs I Other example: no two courses can meet in the same room at the same time l {room, time} {course} s Or short: room, time course 421 B: Database Systems - Functional Dependencies 3

FDs (contd. ) q An FD is a constraint on allowable instances of a

FDs (contd. ) q An FD is a constraint on allowable instances of a relation R (on data that might appear in R) I Given some instance r of R, we can check if it violates some FD f, but we CANNOT deduce that some FD f’ holds over R only because it holds over r q Goal (very informal): a relation should only contain key induced FDs I We will see that other FDs lead to some form of redundancy I We achieve this goal by decomposing relations into smaller relations 421 B: Database Systems - Functional Dependencies 4

Example FDs q Consider relation obtained from Hourly_Emps: I Hourly_Emps (sin, rating, hourly_wages, hourly_worked)

Example FDs q Consider relation obtained from Hourly_Emps: I Hourly_Emps (sin, rating, hourly_wages, hourly_worked) Notation: we will denote this relation schema by listing the attributes SRWH q Some FDs on Hourly_Emps: q I sin is the key: S SRWH I rating determines hourly_wages: R W S R W H 123 -22 -3661 231 -31 -5368 434 -26 -3151 612 -37 -2814 8 8 5 8 10 10 15 10 40 30 37 42 421 B: Database Systems - Functional Dependencies 5

Example Redundancy Anomalies q Update/Insertion/Deletion Anomalies due to R W q How to solve

Example Redundancy Anomalies q Update/Insertion/Deletion Anomalies due to R W q How to solve the problems? I Wages is an entity set by itself with two attributes. S 123 -22 -3661 231 -31 -5368 434 -26 -3151 612 -37 -2814 R 8 8 5 8 W 10 10 15 10 H S R H 40 30 37 42 123 -22 -3661 231 -31 -5368 434 -26 -3151 612 -37 -2814 8 8 5 8 40 30 37 42 421 B: Database Systems - Functional Dependencies + R W 8 5 10 15 6

Reasoning about FDs When improving relational design, we often need to ask “does this

Reasoning about FDs When improving relational design, we often need to ask “does this FD hold in this relation” q Given some FDs, we can usually infer additional FDs: q I I q Example: sin --> did, did --> location implies sin --> location Basic set F of FDs (determined a priori by designer) An FD f is implied by a set of FDs F if f holds whenever all FDs in F hold F+ = closure of F: set of all FDs that are implied by F Armstrong’s Axioms (X, Y, Z are sets of attributes, i. e, X, Y, Z schema(R)): I Reflexivity: If Y X, then X Y I Augmentation: If X Y, then XZ YZ for any Z I Transitivity: If X Y and Y Z, then X Z These are sound and complete inference rules for FDs (only deduce FDs that hold and deduce all FDs) q Some additional rules (that follow from AA): I Union: If X Y and X Z, then X YZ I Decomposition: If X YZ, then X Y and X Z 421 B: Database Systems - Functional Dependencies 7

Reasoning about FDs (Example) q Example: Contracts(C, S, J, D, P, V) I contract-id,

Reasoning about FDs (Example) q Example: Contracts(C, S, J, D, P, V) I contract-id, supplier-id, pro. Ject-id, department-id, part-id, value I C is the key: C CSJDPV l (or more precise: {C} {C, S, J, D, P, V}) I Application Semantics: Project (J) purchases each part usingle contract: I I JP C Application Semantics: Department purchases at most one part from a supplier: SD P JP C, C CSJPDPV imply JP CSJDPV (transitivity; JP is also key candidate) SD P implies SDJ JP (augmentation) SDJ JP, JP CSJPDPV imply SDJ CSJDPV (transitivity; SDJ is also key candidate) 421 B: Database Systems - Functional Dependencies 8

Attribute Closure of X q Computing F+ given F is expensive I Size of

Attribute Closure of X q Computing F+ given F is expensive I Size of closure is exponential in #attributes q Attribute closure of X = X+ wrt F: I set of all attributes A such that X A is in F+ l Basis: X+ = X l Induction: If Z X+ , and Z A is in F, then add A to X+ l End when X + cannot be changed I Linear time q Example F = {A B, BC D} : I A+ = AB, B+ = B, C+ = C, (AB) + = AB l More precise: {A}+ = {A, B}, {B}+ = {B}, {C}+ = {C}, {AB} + = {A, B} I (BC)+ = BCD, I (AC)+ = ABCD I… 421 B: Database Systems - Functional Dependencies 9

Using Attribute Closure q Problem: Is FD X Y F+ ? I Is X

Using Attribute Closure q Problem: Is FD X Y F+ ? I Is X Y implied by F? q Solution: I Check whether Y in X+ wrt. F q Does F = {A B, B C, C D E } imply A E? i. e, is A E in the closure F+ ? I Equivalently, is E in A+ ? I 421 B: Database Systems - Functional Dependencies 10

Finding all key candidates q Sometimes we want to find all the key candidates

Finding all key candidates q Sometimes we want to find all the key candidates for a given relation I a set of attributes X is a key candidate if X R and there is no subset Y X such that Y R q Algorithm to determine the set K of all key candidates I For each set of attributes X l Set iskey = true l If X+ R then iskey = false l For each A X while iskey s If (X-A)+ = R then iskey = false l q If iskey then K = K {X} Exponential in the #attributes 421 B: Database Systems - Functional Dependencies 11

Example Bottom Up q R(S, C, R, P) and CS P, CS R, CR

Example Bottom Up q R(S, C, R, P) and CS P, CS R, CR S S=student-id, C=course-id, R=ranking, P=examining prof. CS P: a student has in a given course only one examining prof. CS R: in a given course a student receives exactly one ranking CR S: in a given course a specific ranking value can only be assigned to one student (relative evaluation of students) q Key candidates q Informal Reasoning + + I S = {S}, C = {C}, R = {R}, P = {P} I No attribute alone is primary key + I (SC) = {S, C, R, P} I I l because of CS P and CS R I C is not on the right site of any I (SR)+ = {S, R}, (SP)+ = {S, P} I (CR)+ = {C, R, S, P} I l because of CR S and then CS P I I (CP)+ = {C, P}, (RP)+ = {R, P} I (SRP)+ = {S, R, P} I I Not interesting: I l (SCR)+ / (SCP)+ / (CRP)+ / (SCRP)+ = superkey FD, therefore it must be part of any candidate key CS is key candidate as shown on the left CR is key candidate as shown on the left CP is not candidate No 3 -attribute or 4 -attribute set is key candidate because l l 421 B: Database Systems - Functional Dependencies It either does not contain C Or it is a superset of CS or CR 12

Finding the closure of F q q Exponential in #attributes Algorithm q Obvious dependencies

Finding the closure of F q q Exponential in #attributes Algorithm q Obvious dependencies I For each set of attributes X compute X+ l Add X A for each A in X+ - X l Ignore or drop some “obvious” dependencies that follow from others I Trivial FDs: right side is a subset of left side l Consequence: no point in computing {} + or closure of full set of attributes I Drop XY A if X A holds l Consequence: If X+ is all attributes then there is no need to compute closure of superset of X I Keep only fully decomposed FDs: keep X Y and X Z, ignore X YZ 421 B: Database Systems - Functional Dependencies 13

Example q q q q q F = AB C, C D, D A

Example q q q q q F = AB C, C D, D A A+ = A; B+ = B (nothing) C+ = ACD (add C A) D+ = DA (nothing new) (AB)+ = ABCD (add AB D; skip all supersets of AB) (BC)+ = ABCD (nothing new; skip all supersets of BC) (BD)+= ABCD (add BD C; skip all supersets of BD) (AC)+ = ACD; (AD)+ = AD; CD+ = ACD (nothing new) All other sets contain AB, BC, or BD, so skip Thus, the only interesting FDs that follow from F are C A, AB D, BD C 421 B: Database Systems - Functional Dependencies 14

Normalization q Remind the relation between redundancy anomalies and FDs I Only key-induced FDs

Normalization q Remind the relation between redundancy anomalies and FDs I Only key-induced FDs hold: there is no redundancy I Given A B and A is not key candidate: several tuples could have the same A value, and if so, they’ll have the same B value! q If a relation is in a certain normal form (BCNF, 3 NF) it is known that only certain kind of FDs hold. Hence, some forms of anomalies are avoided or minimized. 421 B: Database Systems - Functional Dependencies 15

BCNF and 3 NF q A relation R with FDs F is in Boyce-Codd

BCNF and 3 NF q A relation R with FDs F is in Boyce-Codd Normal Form (BCNF) if for all X A in F+ I A X (trivial FDs) or I X contains a key for R (X is superkey) q In other words, R is in BCNF if the only non-trivial FDs that hold over R are key constraints. I If there are two tuples that agree upon the X value we cannot infer the A value in one tuple from the A value in the other tuple q A relation R with FDs F is in 3 NF (3 rd Normal Form), if for all X A in F+ I A X (trivial FDs) or I X contains a key for R (X is superkey) or I A is part of some key for R. q X A x x a ? If R is in BCNF, then it is in 3 NF 421 B: Database Systems - Functional Dependencies 16

Examples q An additional FD I Hourly_Emps (sin, rating, hourly_wages, hourly_worked) I S SRWH

Examples q An additional FD I Hourly_Emps (sin, rating, hourly_wages, hourly_worked) I S SRWH (sin key) and R W (rating determines wages) I R W violates both BCNF and 3 NF conditions q Dependency only on subkey I Works In(sin, did, daddress) I Did daddress violates both BCNF and 3 NF q Hourly_Emps 2 and Wages are in BCNF I Hourly_Emps 2 (sin, rating, hourly_worked) I Wages (rating, hourly_wages) q In 3 NF but not in BCNF I I q Course Time 421 305 Room T 10 -12 1100 TH 10 -12 1100 M 2 -4 Aaud Room time course, course room Key candidates are {room, time} and {time, course} The FD course room violates the BCNF condition. The FD course room does not violate 3 NF since room part of a key candidate 3 NF means “third normal form”. 1 NF and 2 NF do exist, but only of historical interest 421 B: Database Systems - Functional Dependencies 17

Decomposition q If a relation is not in the desired normal form we decompose

Decomposition q If a relation is not in the desired normal form we decompose it: given relation R with attributes A 1 … An. A decomposition of R consists of replacing R by two or more relations such that I Each new relation scheme contains a subset of the attributes of R I I I q (and no attributes that do not appear in R), and Every attribute of R appears as an attribute in one of the new relations The tuples of R are split such that the corresponding subtuples appear in the new relations. For instance, decompose Hourly_Emps SRWH into SRH and RW Tuples of SRH: SRH(SRWH) Tuples of RW: RW(SRWH) Disadvantage of Decomposition: I some queries become more expensive since several relations must be joined (salary = wages * hours) 421 B: Database Systems - Functional Dependencies 18

Lossless Join Decompositions q A decomposition of schema R into schemas R 1 with

Lossless Join Decompositions q A decomposition of schema R into schemas R 1 with attribute set X and R 2 with attribute set Y is lossless-join w. r. t. to a set of FDs F, if for any instance r of R satisfying F, I X(r) y(r) = r I I. e. , we can reconstruct r by joining the corresponding decomposed relation instances r 1 and r 2. q Lossless decomposition is crucial! A 1 4 7 B 2 5 2 C 3 6 8 AB(r) = r 1 BC(r) = r 2 Decomp. A 1 4 7 B 2 5 2 421 B: Database Systems - Functional Dependencies B 2 5 2 C 3 6 8 Reconstruction r 1 r 2 A B C 1 4 7 1 7 2 5 2 2 2 3 6 8 8 3 19

Lossless Join Decompositions q The decomposition of R into R 1 and R 2

Lossless Join Decompositions q The decomposition of R into R 1 and R 2 is lossless-join wrt F if and only if the closure of F contains I Let {a 1, … an} be the set of attributes that are both in R 1 and R 2: then {a 1, … an} R 1 or {a 1, . . an} R 2 I In particular, if U V (non trivial) holds over R then the decomposition of R into UV and R-V is lossless-join (rejoin does not create new tuples because U is key in UV and foreign key in R-V). We denote this as a decomposition along FD U V S R W H S 123 -22 -3661 231 -31 -5368 434 -26 -3151 612 -37 -2814 8 8 5 8 10 10 15 10 40 30 37 42 123 -22 -3661 231 -31 -5368 434 -26 -3151 612 -37 -2814 421 B: Database Systems - Functional Dependencies R 8 8 5 8 H 40 30 37 42 + R W 8 5 10 15 20

Decomposition into BCNF q Given relation R with FDs F and X Y FD

Decomposition into BCNF q Given relation R with FDs F and X Y FD violating BCNF I I Compute X+ (cannot be all attributes!) Decompose R into X+ and (R - X+) X. Repeat until all relations in BCNF. If X+ = {X, Y} we denote this as a decomposition along X Y R X X+ Decomposition is lossless join q Example CSJDPV, key C, JP C, SD P, J S q I Determine key candidates: C, JP I SD P violates BCNF l (SD)+ = {S, D, P} l decompose into SDP, CSJDV along FD SD P l Only FD in SDP is SD P; SD key candidate for this relation, so in BCNF l FDs in CSJDV are C CSJDV and J S; C key candidate, I J S violates BCNF for CSJDV l decompose CSJDV into JS and CJDV along J S For JS, F+ = {J S}; J is key candidate for this relation, so in BCNF l For CJDV, F+ = {C CSJDV}; C key candidate for this relation, so in BCNF l In general, several dependencies may cause violation of BCNF. The order in which we deal with them could lead to very different sets of relations q Note that we only need to look at F, not at F +. q 421 B: Database Systems - Functional Dependencies 21

BCNF and Dependency Preservation q Dependency preserving decomposition (intuitive) I Give relation R with

BCNF and Dependency Preservation q Dependency preserving decomposition (intuitive) I Give relation R with FDs F. A dependency preserving decomposition allows us to check each dependency in F by looking at one of the decomposed relations. q Not all BCNF decompositions are dependency preserving I E. g. , given CSJDPV, key C, JP C, SD P, J S: l Decompose into SDP, JS and CJDQV (covering SD P and J S) l But now checking JP C requires a join! I Example 2: time room course, course room l Keys are {time, room} and {time, course}, but course room has a left side which is not a superkey l Decompose along course room: R 1(course, room) and R 2(course, time) l Problem: you can only check the FD time room course when joining R 1 and R 2 q Possible workaround in some situations I E. g. , in CSJDPV example above, add JPC to the collection of relations; note that JPC tuples only stored for checking FD on a single relation (Redundancy!) 421 B: Database Systems - Functional Dependencies 22

Formal Definition q Projection: I Let R be a relation schema that is decomposed

Formal Definition q Projection: I Let R be a relation schema that is decomposed into two schemas R 1 and R 2. The projection of F on R 1 (denoted FR 1) is the set of FDs U V in the closure F+ (NOT ONLY F) that involve only attributes in R 1. Note that a dependency U V in F+ is in FR 1 only if all the attributes in U and V are in R 1. q Example: given CSJDPV, key C, JP C, SD P, J S: l l q Decompose into SDP, JS and CJDQV FSDP = {SD P, other trivial}, FJS = {J S, trivial}, FCJDQV = {C JDQV, trivial} Dependency preserving decomposition (formal) I Give relation R with FDs F decomposed into schemas R 1, … Rn. The decomposition is dependency-preserving if (FR 1 FR 2 … FRn)+ = F+. I Example above: lost JP C 421 B: Database Systems - Functional Dependencies 23

3 NF is Dependency Preserving q We can decompose any relation R with an

3 NF is Dependency Preserving q We can decompose any relation R with an arbitrary set FD, such that all decomposed relations are in 3 NF and the decomposition is dependency preserving. However, it might happen that not all dependencies are key induced dependencies I Room time course, course room I Key candidates are {room, time} and {time, course} I The FD course room does not violate 3 NF since room part of a key. However, it cannot be guaranteed by the key condition but must be guaranteed by other forms of constraints. q Obviously, the algorithm for decomposition into BCNF can be used to obtain a decomposition into 3 NF (typically can stop earlier) but it does not preserve dependencies. 421 B: Database Systems - Functional Dependencies 24

Summary If a relation is in BCNF, it is free of redundancies that can

Summary If a relation is in BCNF, it is free of redundancies that can be detected using FDs. Thus, trying to ensure that all relations are in BCNF is a good heuristic. q If a relation is not in BCNF, we can try to decompose it into a collection of BCNF relations. q I Must consider whether all FDs are preserved. If a lossless- join, dependency preserving decomposition into BCNF is not possible (or unsuitable, given typical queries), should consider decomposition into 3 NF q Decompositions should be carried out and/or reexamined while keeping performance requirements in mind. 421 B: Database Systems - Functional Dependencies 25

Overall Design Process q We have assumed schema R is given I R could

Overall Design Process q We have assumed schema R is given I R could have been generated when converting E-R diagram to a set of tables. I R could have been a single relation containing all attributes that are of interest (called universal relation). I Normalization breaks R into smaller relations. I R could have been the result of some ad hoc design of relations, which we then test/convert to normal form. q Might keep “bad” design with redundancy for performance reasons! I Application must guarantee consistency 421 B: Database Systems - Functional Dependencies 26

ER Model and Normalization When an E-R diagram is carefully designed, identifying all entities

ER Model and Normalization When an E-R diagram is carefully designed, identifying all entities correctly, the tables generated from the ER diagram should not need further normalization. q However, in a real (imperfect) design there can be FDs from non-key attributes of an entity to other attributes of the entity q E. g. employee entity with attributes departmentnumber and department-address, and an FD department-number department-address q I Good design would have made department an entity q FDs from non-key attributes of a relationship set possible, but rare --- most relationships are binary 421 B: Database Systems - Functional Dependencies 27

Other Design Issues Some aspects of database design are not caught by normalization q

Other Design Issues Some aspects of database design are not caught by normalization q Examples of bad database design, to be avoided: Instead of earnings(company-id, year, amount), use q I earnings-2000, earnings-2001, earnings-2002, etc. , all on the schema (company-id, earnings). l Above are in BCNF, but make querying across years difficult and needs new table each year I company-year(company-id, earnings-2000, earnings-2001, earnings-2002) l l l Also in BCNF, but also makes querying across years difficult and requires new attribute each year. Is an example of a crosstab, where values for one attribute become column names Used in spreadsheets, and in data analysis tools 421 B: Database Systems - Functional Dependencies 28

Minimal Cover for a Set of FDs q Usage of a minimal cover G

Minimal Cover for a Set of FDs q Usage of a minimal cover G for a given set of FDs F I Closure of F = closure of G I Right hand side of each FD in G is a single attribute I If we modify G by deleting an FD or by deleting attributes from an FD in G, the closure changes. q Intuitively, every FD in G is needed, and “as small as possible” in order to get the same closure as F I E. g. , A B, ABCD E, EF GH, ACDF EG has the following minimal cover: A B, ACD E, EF G and EF H q Rules to create the cover I Put the FDs in standard form: only a single attribute on the right side I Minimize left side of each FD: check whether attribute of left side can be deleted while preserving equivalence I Delete redundant FDs 421 B: Database Systems - Functional Dependencies 29

Dependency Preserving Decomposition into 3 NF q Let R be a relation with a

Dependency Preserving Decomposition into 3 NF q Let R be a relation with a set F of FDs that is a minimal cover I decompose R into R 1…Rn according to the BCNF algorithm such that each R 1…Rn is 3 NF. Let Fi be the projection of F onto the attributes in Ri. I Identify the set N of dependencies in F+ that are not preserved, that is, not included in the closure of the union of Fi’s. I For each FD X A in N, create a relation XA and add it to the decomposition of R. I If N contains several FDs with the same left side X A 1, …, X An we can create one relation Ri(X, A 1, . An) 421 B: Database Systems - Functional Dependencies 30

Example q Concert(orchestra, conductor, city, date-of-concert, yearof-birth, opus, year-of-creation, date-of-world-premiere, composer) I e. g.

Example q Concert(orchestra, conductor, city, date-of-concert, yearof-birth, opus, year-of-creation, date-of-world-premiere, composer) I e. g. , (’Montreal Philharmonic Orchestra’, ‘L. Siegel’, ’Toronto’, ’ 16 -01 - 2006’, 1937, ’Piano Concert No. 1 b-moll Op. 23’, 1874, ’ 25 -10 -1875’, ’P. Tschaikowsky’) q shortcuts: I Or (orchestra), Con (conductor), Ci (city), Doc (date-of-concerts), Yob (year-of-birth), Op (opus), Yoc (year-of-creation), Dow (date-ofworld-premiere), Com (composer) q q q The following functional dependencies hold: Con --> Yob Or --> Con Op --> Yoc, Dow, Com Yoc, Com, Dow --> Op 421 B: Database Systems - Functional Dependencies 31

Example Author No Title Publisher ISBN Elmari 1 Fundamentals of DB systems Addison-Wesley 0

Example Author No Title Publisher ISBN Elmari 1 Fundamentals of DB systems Addison-Wesley 0 -8053 -1755 -4 Navathe 2 Fundamentals of DB systems Addison-Wesley Ramakrishnan 1 Database Management Systems Mc. Graw Hill 0 -07 -232206 -3 Gehrke Database Management Systems Mc. Graw Hill 0 -07 -232206 -3 2 0 -8053 -1755 -4 ISBN → publisher, title q ISBN, no → author q ISBN, author → no q q publisher, title, no → author. 421 B: Database Systems - Functional Dependencies 32