Lecture 10 Database Design and Relational Algebra Monday
Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003 1
Outline • Design of a Relational schema (3. 6) • Relational Algebra (5. 2) • Operations on bags (5. 3, 5. 4) – Reading assignment 5. 3 and 5. 4 (won’t have time to cover in class) 2
Relational Schema Design (or Logical Design) Main idea: • Start with some relational schema • Find out its FD’s • Use them to design a better relational schema 3
Data Anomalies When a database is poorly designed we get anomalies: Redundancy: data is repeated Updated anomalies: need to change in several places Delete anomalies: may lose data when we don’t want 4
Relational Schema Design Recall set attributes (persons with several phones): Name SSN Phone. Number City Fred 123 -45 -6789 206 -555 -1234 Seattle Fred 123 -45 -6789 206 -555 -6543 Seattle Joe 987 -65 -4321 908 -555 -2121 Westfield SSN Name, City but not SSN Phone. Number Anomalies: • Redundancy = repeat data • Update anomalies = Fred moves to “Bellevue” • Deletion anomalies = Joe deletes his phone number: what is his city ? 5
Relation Decomposition Break the relation into two: Name SSN Phone. Number City Fred 123 -45 -6789 206 -555 -1234 Seattle Fred 123 -45 -6789 206 -555 -6543 Seattle Joe 987 -65 -4321 908 -555 -2121 Westfield Name SSN City SSN Phone. Number Fred 123 -45 -6789 Seattle 123 -45 -6789 206 -555 -1234 Joe 987 -65 -4321 Westfield 123 -45 -6789 206 -555 -6543 987 -65 -4321 908 -555 -2121 Anomalies have gone: • No more repeated data • Easy to move Fred to “Bellevue” (how ? ) • Easy to delete all Joe’s phone number (how ? ) 6
Relational Schema Design Conceptual Model: name Product price Person buys name ssn Relational Model: plus FD’s Normalization: Eliminates anomalies 7
Decompositions in General R(A 1, . . . , An, B 1, . . . , Bm, C 1, . . . , Cp) R 1(A 1, . . . , An, B 1, . . . , Bm) R 2(A 1, . . . , An, C 1, . . . , Cp) R 1 = projection of R on A 1, . . . , An, B 1, . . . , Bm R 2 = projection of R on A 1, . . . , An, C 1, . . . , Cp 8
Decomposition • Sometimes it is correct: Name Price Category Gizmo 19. 99 Gadget One. Click 24. 99 Camera Gizmo 19. 99 Camera Name Price Name Category Gizmo 19. 99 Gizmo Gadget One. Click 24. 99 One. Click Camera Gizmo 19. 99 Gizmo Camera Lossless decomposition 9
Incorrect Decomposition • Sometimes it is not: Name Price Category Gizmo 19. 99 Gadget One. Click 24. 99 Camera Gizmo 19. 99 Camera What’s incorrect ? ? Name Category Price Category Gizmo Gadget 19. 99 Gadget One. Click Camera 24. 99 Camera Gizmo Camera 19. 99 Camera Lossy decomposition 10
Decompositions in General R(A 1, . . . , An, B 1, . . . , Bm, C 1, . . . , Cp) R 1(A 1, . . . , An, B 1, . . . , Bm) R 2(A 1, . . . , An, C 1, . . . , Cp) If A 1, . . . , An B 1, . . . , Bm Then the decomposition is lossless Note: don’t need necessarily A 1, . . . , An C 1, . . . , Cp Example: name price, hence the first decomposition is lossless 11
Normal Forms First Normal Form = all attributes are atomic Second Normal Form (2 NF) = old and obsolete Third Normal Form (3 NF) = this lecture Boyce Codd Normal Form (BCNF) = this lecture Others. . . 12
Boyce-Codd Normal Form A simple condition for removing anomalies from relations: A relation R is in BCNF if: If A 1, . . . , An B is a non-trivial dependency in R , then {A 1, . . . , An} is a key for R In English (though a bit vague): Whenever a set of attributes of R is determining another attribute, should determine all the attributes of R. 13
BCNF Decomposition Algorithm Repeat choose A 1, …, Am B 1, …, Bn that violates the BNCF condition split R into R 1(A 1, …, Am, B 1, …, Bn) and R 2(A 1, …, Am, [others]) continue with both R 1 and R 2 Until no more violations B’s R 1 A’s Others R 2 Is there a 2 -attribute relation that is not in BCNF ? 14
Example Name SSN Phone. Number City Fred 123 -45 -6789 206 -555 -1234 Seattle Fred 123 -45 -6789 206 -555 -6543 Seattle Joe 987 -65 -4321 908 -555 -2121 Westfield Joe 987 -65 -4321 908 -555 -1234 Westfield What are the dependencies? SSN Name, City What are the keys? {SSN, Phone. Number} Is it in BCNF? 15
Decompose it into BCNF Name SSN City Fred 123 -45 -6789 Seattle Joe 987 -65 -4321 Westfield SSN Phone. Number 123 -45 -6789 206 -555 -1234 123 -45 -6789 206 -555 -6543 987 -65 -4321 908 -555 -2121 987 -65 -4321 908 -555 -1234 SSN Name, City Let’s check anomalies: • Redundancy ? • Update ? • Delete ? 16
Summary of BCNF Decomposition Find a dependency that violates the BCNF condition: A 1, A 2, …, An B 1, B 2, …, Bm Heuristics: choose B 1 , B 2, … Bm“as large as possible” Decompose: Is there a 2 -attribute relation that is not in BCNF ? Others R 1 A’s B’s R 2 Continue until there are no BCNF violations left. 17
Example Decomposition Person(name, SSN, age, hair. Color, phone. Number) SSN name, age hair. Color Decompose in BCNF (in class): Step 1: find all keys (How ? Compute S+, for various sets S) Step 2: now decompose 18
Other Example • R(A, B, C, D) A B, B C • • Key: AD Violations of BCNF: A B, A C, A BC Pick A BC: split into R 1(A, BC) R 2(A, D) What happens if we pick A B first ? 19
Lossless Decompositions A decomposition is lossless if we can recover: R(A, B, C) Decompose R 1(A, B) R 2(A, C) Recover R’(A, B, C) should be the same as R(A, B, C) R’ is in general larger than R. Must ensure R’ = R 20
Lossless Decompositions • Given R(A, B, C) s. t. A B, the decomposition into R 1(A, B), R 2(A, C) is lossless 21
3 NF: A Problem with BCNF Unit Company Product FD’s: Unit Company; Company, Product Unit So, there is a BCNF violation, and we decompose. Unit Company Unit Product Unit Company No FDs Notice: we loose the FD: Company, Product Unit 22
So What’s the Problem? Unit Company Galaga 99 Bingo UW UW Unit Galaga 99 Bingo Product databases No problem so far. All local FD’s are satisfied. Let’s put all the data back into a single table again: Unit Galaga 99 Bingo Company UW UW Product databases Violates the dependency: company, product -> unit! 23
Solution: 3 rd Normal Form (3 NF) A simple condition for removing anomalies from relations: A relation R is in 3 rd normal form if : Whenever there is a nontrivial dependency A 1, A 2, . . . , An B for R , then {A 1, A 2, . . . , An } a super-key for R, or B is part of a key. Tradeoff: BCNF = no anomalies, but may lose some FDs 3 NF = keeps all FDs, but may have some anomalies 24
Relational Algebra • Formalism for creating new relations from existing ones • Its place in the big picture: Declartive query language SQL, relational calculus Algebra Relational algebra Relational bag algebra Implementation 25
Relational Algebra • Five operators: – – – Union: Difference: Selection: s Projection: P Cartesian Product: • Derived or auxiliary operators: – Intersection, complement – Joins (natural, equi-join, theta join, semi-join) – Renaming: r 26
1. Union and 2. Difference • R 1 R 2 • Example: – Active. Employees Retired. Employees • R 1 – R 2 • Example: – All. Employees -- Retired. Employees 27
What about Intersection ? • • It is a derived operator R 1 R 2 = R 1 – (R 1 – R 2) Also expressed as a join (will see later) Example – Unionized. Employees Retired. Employees 28
3. Selection • Returns all tuples which satisfy a condition • Notation: sc(R) • Examples – s. Salary > 40000 (Employee) – sname = “Smithh” (Employee) • The condition c can be =, <, , >, , <> 29
Find all employees with salary more than $40, 000. s Salary > 40000 (Employee) 30
4. Projection • Eliminates columns, then removes duplicates • Notation: P A 1, …, An (R) • Example: project social-security number and names: – P SSN, Name (Employee) – Output schema: Answer(SSN, Name) 31
P SSN, Name (Employee) 32
5. Cartesian Product • Each tuple in R 1 with each tuple in R 2 • Notation: R 1 R 2 • Example: – Employee Dependents • Very rare in practice; mainly used to express joins 33
34
Relational Algebra • Five operators: – – – Union: Difference: Selection: s Projection: P Cartesian Product: • Derived or auxiliary operators: – Intersection, complement – Joins (natural, equi-join, theta join, semi-join) – Renaming: r 35
Renaming • Changes the schema, not the instance • Notation: r B 1, …, Bn (R) • Example: – r. Last. Name, Soc. No (Employee) – Output schema: Answer(Last. Name, Soc. No) 36
Renaming Example Employee Name John Tony r. Last. Name, Soc. No Last. Name John Tony SSN 99999 77777 (Employee) Soc. No 99999 77777 37
Natural Join • Notation: R 1 ⋈ R 2 • Meaning: R 1 ⋈ R 2 = PA(s. C(R 1 R 2)) • Where: – The selection s. C checks equality of all common attributes – The projection eliminates the duplicate common attributes 38
Natural Join Example Employee Name John Tony SSN 99999 77777 Dependents SSN 99999 77777 Dname Emily Joe Employee Dependents = PName, SSN, Dname(s SSN=SSN 2(Employee x r. SSN 2, Dname(Dependents)) Name John Tony SSN 99999 77777 Dname Emily Joe 39
Natural Join • R= • R ⋈ S= A B X S= B C Y Z U X Z V W Y Z Z V A B C X Z U X Z V Y Z U Y Z V W 40
Natural Join • Given the schemas R(A, B, C, D), S(A, C, E), what is the schema of R ⋈ S ? • Given R(A, B, C), S(D, E), what is R ⋈ S ? • Given R(A, B), S(A, B), what is R ⋈ S ? 41
Theta Join • A join that involves a predicate • R 1 ⋈ q R 2 = s q (R 1 R 2) • Here q can be any condition 42
Eq-join • A theta join where q is an equality • R 1 ⋈A=B R 2 = s A=B (R 1 R 2) • Example: – Employee ⋈SSN=SSN Dependents • Most useful join in practice 43
Semijoin • R ⋉ S = P A 1, …, An (R ⋈ S) • Where A 1, …, An are the attributes in R • Example: – Employee ⋉ Dependents 44
Semijoins in Distributed Databases • Semijoins are used in distributed databases Dependents Employee SSN Name . . . SSN network . . . Dname Age. . . Employee ⋈ssn=ssn (s age>71 (Dependents)) R = Employee ⋉ T T = P SSN s age>71 (Dependents) 45 Answer = R ⋈ Dependents
Complex RA Expressions P name buyer-ssn=ssn pid=pid seller-ssn=ssn Person Purchase P pid sname=fred sname=gizmo Person Product 46
Operations on Bags A bag = a set with repeated elements All operations need to be defined carefully on bags • {a, b, b, c} {a, b, b, b, e, f, f}={a, a, b, b, b, c, e, f, f} • {a, b, b, b, c, c} – {b, c, c, c, d} = {a, b, b, d} • s. C(R): preserve the number of occurrences • PA(R): no duplicate elimination • Cartesian product, join: no duplicate elimination Important ! Relational Engines work on bags, not sets ! Reading assignment: 5. 3 – 5. 4 47
Finally: RA has Limitations ! • Cannot compute “transitive closure” Name 1 Name 2 Relationship Fred Mary Father Mary Joe Cousin Mary Bill Spouse Nancy Lou Sister • Find all direct and indirect relatives of Fred • Cannot express in RA !!! Need to write C program 48
- Slides: 48