DATABASE NORMALIZATION What is Normalization Normalization allows us

DATABASE NORMALIZATION

What is Normalization? • Normalization allows us to organize data so that it: – Allows faster access (dependencies make sense) – Reduced space (less redundancy) § Database normalization relates to the level of redundancy in a relational database’s structure. § The key idea is to reduce the chance of having multiple different version of the same data. § Any increase in normalization generally involves splitting existing tables into multiple ones, which must be re-joined each time a query is issued.

Normal Form • Normalization is done through changing or transforming data • • into various Normal Forms. Edgar F. Codd originally established three normal forms: 1 NF, 2 NF and 3 NF is widely considered to be sufficient are also in BCNF (Boyce-Codd Normal Form). Full normalization is considered a good exercise to help discover all potential internal database consistency problems. For a database to be in a normal form, it must meet all requirements of the previous forms: – Eg. For a database to be in 2 NF, it must already be in 1 NF. For a database to be in 3 NF, it must already be in 1 NF and 2 NF.

Sample Data n This data has some problems: • The Employees column is not atomic. n A column must be atomic, meaning that it can only hold a single item of data. This column holds more than one employee name.

n Data that is not atomic means: • We can’t easily sort the data • We can’t easily search or index the data • We can’t easily change the data • We can’t easily reference the data in other tables

n Breaking the Employee column into more than 1 column doesn’t solve our problems: • The data may look atomic, but only because we have many identical columns storing a single piece of data instead of a single column storing many pieces of data.

n By the way, what would be a good choice of a Primary Key for this table?

First Normal Form • 1 NF means that we must: – Eliminate duplicate columns from the same table, and – Create separate tables for each group of related data into separate tables, each with a unique row identifier (primary key) • Let’s get started by making our columns atomic…

Atomic Data Ø By breaking each tuple of our table into an entry for each employee, we have made our data atomic. Ø What would be the primary key?

Primary Key Ø The best primary key would be the Employee column. Ø Every employee only has one manager, therefore an employee is unique.

First Normal Form Ø The fact that all our data and columns is atomic and we have a primary key means that we are in 1 NF!

First Normal Form Revised Ø Of course there may come a day when we hire a second employee or manager with the same name. To avoid this, let’s use an employee ID instead of their name.

1 NF: Before and After

Second normal Form ( 2 NF ) • A database in 2 NF must also be in 1 NF: – Data must be atomic – Every row (or tuple) must have a unique primary key. • Plus: – Subsets of data that apply to multiple rows (repeating data) are moved to separate tables

This data is in 1 NF: all fields are atomic and the Cust. ID serves as the primary key

Ø But let’s pay attention to the City, State, and Zip fields: l l There are 2 rows of repeating data: one for Chicago, and one for St. Paul. Both have the same city, state and zip code

• The Cust. ID determines all the data in the row, but U. S. Zip codes determines the City and State. (eg. A given Zip code can only belong to one city and state so storing Zip codes with a City and State is redundant). • This means that City and State are Functionally Dependent on the value in Zip code and not only the primary key.

• To be in 2 NF, this repeating data must be in its own table. • So: – Let’s create a Zip code table that maps Zip codes to their City and State. – Note that Canadian Postal Codes are different: the same city and state can have many different postal codes.

Zip Code Table Customer Table Our Data in 2 NF • We see that we can actually save 2 rows in the Zip Code table by removing these redundancies: 9 customer records only need 7 Zip code records. • Zip code becomes a foreign key in the customer table linked to the primary key in the Zip code table

Advantages of 2 NF • Saves space in the database by reducing redundancies. • If a customer calls, you can just ask them for their Zip code and you’ll know their city and state! (No more spelling mistakes). • If a City name changes, we only need to make one change to the database.

Summary So Far… • 1 NF: – All data is atomic – All rows have a unique primary key • 2 NF: – Data is in 1 NF – Subsets of data in multiple columns are moved to a new table – These new tables are related using foreign keys

Third Normal Form • To be in 3 NF, a database must be: – In 2 NF – All columns must be fully functionally dependent on the primary key (There are no transitive dependencies)

• In this table: – Customer. ID and Prod. ID depend on the Order. ID and no other column (good) – Stated another way, “If you know the Order. ID, you know the Cust. ID and the Prod. ID” • So: Order. ID Cust. ID, Prod. ID

• But there are some fields that are not dependent on Order. ID: – Total is the simple product of Price*Quantity. As such, has a transitive dependency to Price and Quantity. – Because it is a calculated value, doesn’t need to be included at all.

• We say that Price has a transitive dependency on Prod. ID and Quantity. – This means that Price isn’t just determined by the Order. ID. It is also determined by the size (or quantity) of the order (and of course what is ordered).

• We can see that all fields are dependent on Order. ID, the Primary Key (white lines).

• But Total is also determined by Price and Quantity (yellow lines) – This is a derived field (Price x Quantity = Total) – We can save a lot of space by getting rid of it altogether and just calculating total when we need it

• Price is also determined by both Prod. ID and Quantity rather than the primary key (red lines). This is called a transitive dependency. We must get rid of transitive dependencies to have 3 NF.

• We do this by moving the transitive dependency into a second table…

Ø By splitting out the table, we can quickly adjust our price table to meet our competitor, or if the prices changes from our suppliers.

• We’re now in 3 NF! • We can also quickly figure out what price to offer our customers for any quantity they want.

Boyce-Codd normal form (BCNF) • A table is said to be in the BCNF if and only if it is in the 3 NF and every non-trivial, left-irreducible functional dependency has a candidate key as its determinant. • In more informal terms, a table is in BCNF if it is in 3 NF and the only determinants are the candidate keys.

Fourth normal form (4 NF) • A table is said to be in 4 NF if and only if it is in the BCNF and multi-valued dependencies are functional dependencies.

Fifth normal form 5 NF • A table is said to be in the 5 NF if and only if it is in 4 NF and every join dependency in it is implied by the candidate keys.

Sixth normal form(6 NF) • This normal form was, as of 2005, only recently conjectured: the sixth normal form (6 NF) was only defined when extending the relational model to take into account the temporal dimension (ie. time). • Unfortunately, most current SQL technologies as of 2005 do not take into account this work, and most temporal extensions to SQL are not relational.

THANK YOU!!!!!!!