Normalization CSC 240 Blum 1 Normalization Normalization is
Normalization CSC 240 (Blum) 1
Normalization • Normalization is a procedure within relational database design to ensure that 1. Each unit of data (value that a property takes on) is accessible to querying 2. Relationships are properly modeled 3. Data redundancy (unnecessary repetition) is minimized. CSC 240 (Blum) 2
Complementary Approaches • ER modeling is more of a top-down approach to database design. – One starts with an overview of the most generic categories of data. • Normalization is more of a bottom-up approach. – One starts with a collection of attributes and sees how they break into various tables. CSC 240 (Blum) 3
Unique Data Specification • A piece of data should be identifiable by specifying first its Table, then its column and its row within that table No further specification should be required. CSC 240 (Blum) 4
The data “Bart” is found in the Character Table in the First. Name Column in the 78 th row (better yet in the row with Character. ID SC 0003). CSC 240 (Blum) 5
Rule 1: No multi-valued fields • A “cell” in the table should hold only one unit of data. – If a cell contained more than one unit of data, there would be no way to specify one of these units within the relational (table) approach. • (Multi-valued fields can also cause storage problems, which are hidden from us when we work at the 4 GL level. ) CSC 240 (Blum) 6
person. ID f. Name l. Name characters SP 0004 Castellenata Homer Simpson Krusty the Clown Grampa Simpson Barney Gumble Itchy (the mouse) Mayor Quimby Groundskeeper Willy Scott Christian Hans Moleman Arnie Pie Sideshow Mel CSC 240 (Blum) Dan No multi-valued fields! 7
Flattening the file (table) • The easy way to eliminate multi-valued fields is to introduce multiple records, one for each value of the multi-valued field – All of the other fields would be the same. – This is called flattening. CSC 240 (Blum) 8
person. ID SP 0004 SP 0004 CSC 240 (Blum) f. Name Dan Dan l. Name Castellenata Castellenata character Homer Simpson Krusty the Clown Grampa Simpson Barney Gumble Itchy (the mouse) Mayor Quimby Groundskeeper Willy Scott Christian 9
First Normal Form • If all of the tables in a database are set up such that each entry (cell) of each table contains at most one piece of data (i. e. is single-valued), then the database is said to be in the First Normal Form (1 NF). – There is a series of criteria to apply, each more stringent than the next. There will be a second normal form, a third normal form and so on. CSC 240 (Blum) 10
What has been gained? • The advantage of flattening is that each piece of data is now accessible without further parsing. • “Barney Gumble” is no longer part of an entry but is its own entry. CSC 240 (Blum) 11
What has been lost? • The obvious price of flattening a file is redundancy (repetition of data). • In addition to the storage issues that repeated data causes (which are hidden from us), there are the so-called data anomalies. • These make maintaining the data’s integrity (correctness) more difficult. CSC 240 (Blum) 12
What’s the key? • Recall the key is used to identify (uniquely) a record and is important to querying. • Note that person. ID was the key to the table that had multi-valued fields. • But in this new table person. ID alone cannot be a key. CSC 240 (Blum) 13
Key consideration • The character field can be the key if we do not include the producers (who do not voice characters). • Or we could have a composite key. – One could introduce a no. Character character and a no. Actor person to solve the NULL problem. – Note that no part of a composite key should be NULL. CSC 240 (Blum) 14
Insertion Anomaly • Ignoring the issue of Maggie (who no one voices) and the producers (who do not voice characters) there’s another problem. • Since the character and real person information are both needed: – We cannot add a character without knowing the real person who provides the voice. – We cannot add a person without knowing at least one character he/she voices. • This type of problem is known as an insertion anomaly. CSC 240 (Blum) 15
Deletion Anomaly • The way the data is arranged now, if Dan Castellenata is fired and eliminated from the database, so too are all of the characters he voices eliminated. – This feature is sometimes desired and can be built in. • When this feature is not desired, but occurs because of the database design, it is known as a deletion anomaly. CSC 240 (Blum) 16
Update Anomaly D’oh • Isn’t his name Castellaneta, not Castellenata? • Because of all of the repetition caused by the flattening, the name must be changed in many places. – One either does a search to find all occurrences. – Or one runs the risk of having inconsistent data and consequently queries that don’t turn up complete and accurate information. • Inaccurate data caused by the inconsistent updating of redundant information is known as an update anomaly. CSC 240 (Blum) 17
person. ID SP 0004 SP 0004 CSC 240 (Blum) f. Name Dan Dan l. Name Castellaneta Castellanata Castelleneta Castellaneta character Homer Simpson Krusty the Clown Grampa Simpson Barney Gumble Itchy (the mouse) Mayor Quimby Groundskeeper Willy Scott Christian 18
More Tables, Less Redundancy • One can eliminate the data redundancy by introducing more tables. – This will simplify updating but can complicate querying. • The repeated data will be moved to its own table and in its place will be inserted a foreign key to identify the appropriate row of the new table. CSC 240 (Blum) 19
Splitting • The table – People. Voicing. Characters(person. ID, character. ID, person. FName, person. LName, regular, character. FName, character. LName, description) becomes two tables – Real. Person(person. ID, f. Name, l. Name, regular) – Character(character. ID, f. Name, l. Name, person. ID, description) CSC 240 (Blum) 20
Repetition cannot be completely eliminated • When People. Voicing. Characters became Real. Person and Character, the person. ID field appeared in both tables. This repetition is required to maintain the relationship between a character and a real person. – The decomposition minimized (but did not eliminate) the repetition. • Another feature of the split is that we can add unvoiced characters to the Character table and people who are not actors to the Real. Person table. CSC 240 (Blum) 21
Important Properties • Two important properties of decomposition: – One should be able to put the original table back together by using a join, thus no information is lost by the split. This is called the Lossless-join property. – Any constraint on the data imposed on the original table should be imposed on the resulting tables. This is called the Dependency preservation property. CSC 240 (Blum) 22
How to do the splits? • We would like to have a systematic approach for splitting tables in a way that minimizing redundancy. • This is where the notion of functional dependence comes in. – How the fields depend on each other determines where they end up when the table splits. CSC 240 (Blum) 23
Functional dependence • If for all of the records in a table, one can determine what value field B takes on simply by knowing what value field A has, then field B is said to be functionally dependent on field A. – Either of the fields above may be composite, that is, comprised of other fields (e. g. name is a composite field comprised of f. Name and l. Name fields). CSC 240 (Blum) 24
Derived attributes are functionally dependent • We called an attribute “derived” if it was determined by another attribute. – E. g. age or age. Category can be derived from date. Of. Birth • Thus age is functionally dependent on date. Of. Birth. • Functional dependence is not necessarily a twoway street – You can determine age. Category from date. Of. Birth but you cannot determine date. Of. Birth from age. Category. CSC 240 (Blum) 25
Other functional dependencies • A date. Of. Birth is functionally dependent on a soc. Sec. Num. – There’s no mathematical formula relating the two. – But the soc. Sec. Num identifies a person, and that person as a unique date. Of. Birth. – Knowing the soc. Sec. Num, one can determine the date. Of. Birth. CSC 240 (Blum) 26
All fields depend on the primary key. • Recall that a table’s primary key is a field or fields that can be used to identify uniquely each record belonging to that table. • If a field can be uniquely identified by the primary key then it is functionally dependent on the primary key. CSC 240 (Blum) 27
Composite primary keys • When the primary key is composite, the question arises: – Is a field uniquely determined by (functionally dependent on) part of the primary key or the whole primary key? • Data that is dependent on part of a primary key can be split off into another table. – Known as partial dependencies. CSC 240 (Blum) 28
Splitting off fields that depend on part of a key. • Imagine a Movie. Director table Movie. Director(movie. ID, director. ID, movie. Title, movie. Type, movie. Description, director. First. Name, director. Last. Name, director. Date. Of. Birth) – It would have a composite primary key comprised of movie. ID and director. ID. CSC 240 (Blum) 29
Depending on parts • The movie. Title, movie. Type, movie. Description fields depend only on the movie. ID portion of the primary key. • The director. First. Name, director. Last. Name and director. Date. Of. Birth fields depend only on the director. ID portion of the primary key. CSC 240 (Blum) 30
Movie. Director: Three-Way Split • We split off any fields that depend only on movie. ID. – Partial dependence (dependent on a part) • We split off any fields that depend only on director. ID. – Partial dependence • We are left with any fields that depend on both. – Full dependence CSC 240 (Blum) 31
Movie. Director: Three-Way Split • One gets: – Movie(movie. ID, movie. Title, movie. Type, movie. Description) – Director(director. ID, director. First. Name, director. Last. Name, director. Date. Of. Birth) – Movie. Director 2(movie. ID, director. ID) • (If you assume that a movie has one director, then the director. ID would be functionally dependent on the movie. ID, and you would only need two tables above. ) CSC 240 (Blum) 32
CSC 240 (Blum) 33
Two or Three-Way Split • In the Simpsons example, the People. Voicing. Characters became two tables, while in film example, the Movie. Director table became three tables. • The difference is that person. ID was functionally dependent on the character. ID, whereas there was no dependency between movie. Id and director. ID. – This agrees with the notion that many-to-many relationships between entities require their own tables. CSC 240 (Blum) 34
Second Normal Form • If a table has no multi-valued fields AND does not have any non-key fields which are functionally dependent on only part of the primary key, then such a table is said to be in the second normal form (2 NF). • The higher normal forms will be presented next time. CSC 240 (Blum) 35
References • Database Systems, Rob and Coronel • Database Systems, Connolly and Begg • Fundamentals of Relational Databases, Mata -Toledo and Cushman • Concepts of Database Management, Pratt and Adamski CSC 240 (Blum) 36
- Slides: 36