Data Normalization Text book Chapter 3 Jerry Post
Data Normalization Text book Chapter 3: Jerry Post Copyright © 2003 1
Introduction l. A Database is a powerful tool. It provides many advantages over traditional programming. However you get these advantages only if you design the database correctly. 2
What is data normalization l It is to split your data into several tables that will be connected to each other based on the data within them l Before data can be normalized you must Understand the business rules l Your tables must match the business rules l 3
Primary and composite keys l Primary Key A column which can uniquely identify a row in a table. E. g. Iqama Number, Saudi Id etc. l Composite l Key If a table is using more than one column as the part of the primary key, is called composite key 4
Identifying Key Columns Order. ID 8367 8368 Date 5 -5 -04 5 -6 -04 Customer 6794 9263 Each order has only one customer. So Customer is not part of the key. Order. Items Order. ID 8367 8368 Item 229 253 876 555 229 Quantity 2 4 1 Each order has many items. Each item can appear on many orders. So Order. ID and Item are both part of the key. 5
Identifying Key Columns If you are uncertain about which columns to key. Write them down and evaluate the business rules. Order. ID Customer. ID For a given order, can there ever be more than one customer? If yes, then key Customer. ID. In most businesses, only one customer per order, so do not key it. For a given customer, can there ever be more than one order? If yes, then key Order. ID, otherwise, do not key it. All businesses hope to get more than one order from a customer, so Order. ID must be key. 6
Surrogate Keys Real world keys sometimes cause problems in a database. l Example: Customer l l l Often best to let the DBMS generate unique values l l Avoid phone numbers: people may notify you when numbers change. Access: Auto. Number SQL Server: Identity Oracle: Sequences (but require additional rogramming) Drawback: Numbers are not related to any business data, so the application needs to hide them and provide other look up mechanisms. 7
Problems with Repeating Sections Rental. Form(Trans. ID, Rent. Date, Customer. ID, Phone, Name, Address, City, State, Zip. Code, (Video. ID, Copy#, Title, Rent ) ) Storing data in this raw form would not work very well. For example, repeating sections will cause problems. Repeating Section Note the duplication of data. Causes duplication Also, what if a customer has not yet checked out a movie--where do we store that customer’s data? Trans. ID 1 1 2 2 2 3 3 3 4 4 Rent. Date 4/18/04 4/30/04 4/18/04 4/18/04 Customer. ID 3 3 7 7 7 8 8 8 3 3 Last. Name Washington Lasater Jones Washington Phone 502 -777 -7575 615 -888 -4474 615 -452 -1162 502 -777 -7575 Address 95 Easy Street 67 S. Ray Drive 867 Lakeside Drive 95 Easy Street Video. ID 1 6 8 2 6 9 15 4 3 8 13 17 Copy# 2 3 1 1 1 1 1 Title 2001: A Space Odyssey Clockwork Orange Hopscotch Apocalypse Now Clockwork Orange Luggage Of The Gods Fabulous Baker Boys Boy And His Dog Blues Brothers Hopscotch Surf Nazis Must Die Witches of Eastwick Rent $1. 50 $2. 00 $1. 50 $2. 00 $2. 50 $2. 00 $1. 50 $2. 00 8
First Normal Form Problems (Data) Trans. ID 1 2 3 4 Rent. Date 4/18/04 4/30/04 4/18/04 Cust. ID 3 7 8 3 Phone 502 -777 -7575 615 -888 -4474 615 -452 -1162 502 -777 -7575 Last. Name Washington Lasater Jones Washington 1 NF splits repeating groups l Still have problems First. Name Elroy Les Charlie Elroy Address 95 Easy Street 67 S. Ray Drive 867 Lakeside Drive 95 Easy Street City Smith's Grove Portland Castalian Springs Smith's Grove State KY TN TN KY Zip. Code 42171 37148 37031 42171 l l Replication Hidden dependency: If a video has not been rented yet, then what is its title? Trans. ID 1 1 2 2 2 3 3 3 4 4 Video. ID 1 6 8 2 6 9 15 4 3 8 13 17 Copy# 2 3 1 1 1 1 1 Title 2001: A Space Odyssey Clockwork Orange Hopscotch Apocalypse Now Clockwork Orange Luggage Of The Gods Fabulous Baker Boys Boy And His Dog Blues Brothers Hopscotch Surf Nazis Must Die Witches of Eastwick Rent $1. 50 $2. 00 $1. 50 $2. 00 $2. 50 $2. 00 $1. 50 $2. 00 9
Second Normal Form A relation is in second normal form (2 NF) if and only if it is in 1 NF and every non key attribute is fully dependent on the primary key 10
Second Normal Form Example (Data) Videos. Rented(Trans. ID, Video. ID, Copy#) Trans. ID 1 1 2 2 2 3 3 3 4 4 Video. ID 1 6 2 6 8 4 9 15 3 8 13 17 Copy# 2 3 1 1 1 1 1 Videos(Video. ID, Title, Rent) Video. ID 1 2 3 4 5 6 7 8 Title 2001: A Space Odyssey Apocalypse Now Blues Brothers Boy And His Dog Brother From Another Planet Clockwork Orange Gods Must Be Crazy Hopscotch Rent $1. 50 $2. 00 $2. 50 $2. 00 $1. 50 (Unchanged) Rental. Form 2(Trans. ID, Rent. Date, Customer. ID, Phone, Name, Address, City, State, Zip. Code) 11
Second Normal Form Example Rental. Line(Trans. ID, Video. ID, Copy#, Title, Rent) Videos. Rented(Trans. ID, Video. ID, Copy#) Videos(Video. ID, Title, Rent) Title depends only on Video. ID l Each Video. ID can have only one title l Rent depends on Video. ID l This statement is actually a business rule. l It might be different at different stores. l Some stores might charge a different for each video depending on the day (or time). l Each non-key column depends on the key. l 12
Second Normal Form Problems (Data) Rental. Form 2(Trans. ID, Rent. Date, Customer. ID, Phone, Name, Address, City, State, Zip. Code) Trans. ID 1 2 3 4 Rent. Date 4/18/04 4/30/04 4/18/042 Cust. ID 3 7 8 3 Phone 502 -777 -7575 615 -888 -4474 615 -452 -1162 502 -777 -7575 Last. Name Washington Lasater Jones Washington First. Name Elroy Les Charlie Elroy Address 95 Easy Street 67 S. Ray Drive 867 Lakeside Drive 95 Easy Street City Smith's Grove Portland Castalian Springs Smith's Grove State KY TN TN KY Zip. Code 42171 37148 37031 42171 Even in 2 NF, problems remain l Replication l Hidden dependency l If a customer has not rented a video yet, where do we store their personal data? l Solution: split table. l 13
Third Normal Form Definition Depend on Trans. ID Rental. Form 2(Trans. ID, Rent. Date, Customer. ID, Phone, Name, Address, City, State, Zip. Code) Depend only on Customer. ID l Each non-key column must depend on nothing but the key. l l l Some columns depend on columns that are not part of the key. Split those into a new table. Example: Customers name does not change for every transaction. l Dependence (definition) l If given a value for the key you always know the value of the property in question, then that property is said to depend on the key. 14
Third Normal Form Example Data Rentals(Trans. ID, Rent. Date, Customer. ID ) Trans. ID 1 2 3 4 Rent. Date 4/18/04 4/30/04 4/18/04 Customer. ID 3 7 8 3 Customers(Customer. ID, Phone, Name, Address, City, State, Zip. Code ) Customer. ID 1 2 3 4 5 6 7 8 9 10 Phone 502 -666 -7777 502 -888 -6464 502 -777 -7575 502 -333 -9494 502 -4746 615 -373 -4746 615 -888 -4474 615 -452 -1162 502 -222 -4351 502 -444 -2512 Last. Name Johnson Smith Washington Adams Rabitz Steinmetz Lasater Jones Chavez Rojo First. Name Martha Jack Elroy Samuel Victor Susan Les Charlie Juan Maria Address City 125 Main Street Alvaton 873 Elm Street Bowling Green 95 Easy Street Smith's Grove 746 Brown Drive Alvaton 645 White Avenue Bowling Green 15 Speedway Drive. Portland 67 S. Ray Drive Portland 867 Lakeside Drive Castalian Springs 673 Industry Blvd. Caneyville 88 Main Street Cave City State KY KY KY TN TN TN KY KY Zip. Code 42122 42101 42171 42122 42102 37148 37031 42721 42127 15
Third Normal Form Tables (3 NF) Rentals Customers * 1 Trans. ID Rent. Date Customer. ID Phone Last. Name First. Name Address City State Zip. Code 1 * Videos. Rented Trans. ID Video. ID Copy# * Videos 1 Video. ID Title Rentals(Trans. ID, Rent. Date, Customer. ID ) Customers(Customer. ID, Phone, Name, Address, City, State, Zip. Code ) Videos. Rented(Trans. ID, Video. ID, Copy#) Videos(Video. ID, Title, Rent) 16
3 NF Rules/Procedure l Split out repeating sections l l Verify that the keys are correct l l Be sure to include a key from the parent section in the new piece so the two parts can be recombined. Is each row uniquely identified by the primary key? Are one-to-many and many-to-many relationships correct? Check “many” for keyed columns and “one” for non-key columns. Make sure that each non-key column depends on the whole key and nothing but the key. l No hidden dependencies. 17
Fourth Normal Form (Keys) Employee. Tasks(EID, Specialty, Tool. ID) Problem arise when there are two binary relationships l In some cases, there are hidden relationships between key properties. l Example: l l Employee. Tasks(EID, Specialty, Tool. ID) In 3 NF now. Employee. Tools(EID, Tool. ID)) Employee. Specialty(EID, Specialty)) Business Rules Each employee has many specialties. l Each employee has many tools. Tools and are unrelated l 18
Domain-Key Normal Form (DKNF) l This describes the ultimate goal in designing a database l If a table is in DKNF it must also be in 4 NF, 3 NF, and all of the other normal forms l The catch is that there is no defined method to get a table into DKNF. l In fact, it is possible that some tables can never be converted to DKNF 19
DKNF (Continues) The goal of DKNF is to have each table represent one topic l All business rules are explicitly described by a table rules. For example prices cannot be negative etc. l All other business rules must be expressed in terms of relationships with keys l In particular, there can be no hidden relationships l 20
No Hidden Dependencies The simple normalization rules: l Remove repeating sections l Each non-key column must depend on the whole key and nothing but the key. l There must be no hidden dependencies. l Solution: Split the table. l Make sure you can rejoin the two pieces to recreate the original data relationships. l For some hidden dependencies within keys, double-check the business assumption to be sure that it is realistic. Sometimes you are better off with a more flexible assumption. 21 l
Create Tables with SQL CREATE TABLE Customer ( Customer. ID NUMBER(38), Last. Name NVARCHAR 2(25), First. Name NVARCHAR 2(25), Phone NVARCHAR 2(25), Email NVARCHAR 2(120), Address NVARCHAR 2(50), City NVARCHAR 2(50), State NVARCHAR 2(25), ZIP NVARCHAR 2(15), Gender NVARCHAR 2(15), Date. Of. Birth DATE, CONSTRAINT pk_Customer PRIMARY KEY (Customer. ID), CONSTRAINT ck_Cust. Gender CHECK (Upper(Gender) IN ('FEMALE', 'UNIDENTIFIED')) ); 22
Data Rules and Integrity l Simple business rules l Limits on data ranges l l Price > 0 Salary < 100, 000 Date. Hired > 1/12/1995 Choosing from a set l l l Order Gender = M, F, Unknown Jurisdiction=City, County, State, Federal O# 1173 1174 1185 1190 1192 Referential Integrity l l l Foreign key values in one table must exist in the master table. Order(O#, Odate, C#, …) C# must exist in the customer table. Odate 1 -4 -97 1 -5 -97 1 -8 -97 1 -9 -97 C# 321 938 337 321 776 … Customer C# 321 337 938 Name Jones Sanchez Carson Phone 998377388738 - … 23
SQL Foreign Key (Oracle, SQL Server) CREATE TABLE Order ( OID NUMBER(9) NOT NULL, Odate DATE, CID NUMBER(9), CONSTRAINT pk_Order PRIMARY KEY (OID), CONSTRAINT fk_Order. Customer FOREIGN KEY (CID) REFERENCES Customer (CID) ON DELETE CASCADE ); 24
Relationships: Department and Employee Department Employee 1… 1 Department Description Reference Table 1…* Employee. ID Taxpayer. ID Last. Name First. Name Address Phone City State ZIP Department Foreign Key 25
Estimating Database Size Customer. ID Last. Name First. Name Phone Email Address State ZIP Gender Date. Of. Birth Long Text(50) Text(150) Text(50) Text(15) Date 4 30 20 24 50 50 2 14 10 8 Average bytes per customer 212 Customers per week (winter) *200 Weeks (winter) *25 Bytes added per year 1, 060, 000 26
Data Assumptions 200 customers per week for 25 weeks 2 skills per customer 2 rentals per customer per year 3 items per rental 20 percent of customers buy items 4 items per sale 100 manufacturers 20 models per manufacturer 5 items (sizes) per model 27
Database Table Sizes 28
- Slides: 28