Data Warehouse IS 403 Chapter 6 Dimensional Data

























- Slides: 25
Data Warehouse IS 403 Chapter 6 Dimensional Data Modeling Dr Taleb Obaid 1
Objectives • What is a Schema? – Star Schema – Snow Flake Schema – Constellation Schema • • What is a dimensions? Understanding Degenerate Dimension Normalization and Denormalization Why do we need a surrogate key? Dr Taleb Obaid 2
What is a Schema? A schema is a collection of database objects that include views, indexes, tables and synonyms. A database view is a searchable object in a database that is defined by a query. Though a view doesn’t store data, some refer to a views as “virtual tables, ” you can query a view like you can a table. A view can combine data from two or more table, using joins, and also just contain a subset of information. This makes them convenient to abstract, or hide, complicated queries Dr Taleb Obaid 3
What is a Schema? The schema is designed based on the requisites of the project. • The star schema: data is organized into facts and dimensions. A fact is an event that is counted or measured, such as a sale or login. A dimension contains reference information about the fact, such as date, product, or customer. A star schema is diagramed by surrounding each fact with its associated dimensions • snow flake schema, a snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake shape. • galaxy schema, (fact constellation) (collection) A fact constellation has multiple fact tables. It is also known as galaxy schema Dr Taleb Obaid 4
Dimensions A dimension table consists of two columns, primary keys referencing the fact table and textual or definitive data. The dimensions are listed below: 1. Slowly Changing Dimension: The attributes of a dimension change according to time. The change is captured based on the requirements of the business. 2. Junk (rubbish) Dimension: A dimension consisting of various unrelated attributes is a junk dimension. This is created to prevent a large number of foreign keys in a fact table. 3. Inferred Dimension: A dimension record might not be available or ready when loading a fact. This can be solved by generating a surrogate key with Null for all the other attributes. 4. Conformed Dimension: A dimension used across several fact tables in a database is known as a conformed dimension. Dr Taleb Obaid 5
Dimensions Degenerate Dimension: where the dimension attributes are stored as part of a fact table and not in a separate dimension table is known as a degenerate dimension. Facts consist of measures that are numeric values and additive or semi-additive in nature. Additive Facts that can be aggregated through all the dimensions in a fact table are known as an additive fact. Semi-Additive Facts that can be summed up for some of the dimensions in the fact table but not the others are semi-additive facts. Non-Additive Facts A fact that cannot be summed up for any of the dimensions present in the fact table is a non-additive fact. Factless Fact A fact that contains no measures or facts is known as a factless fact table. Dr Taleb Obaid 6
Normalization and Denormalization Normalization • Normalization is a process implemented to minimize the redundancies or the repetition of data present in a database • The process involves dividing the larger tables into smaller tables with lesser redundancies • The smaller tables are interrelated to each other through well-defined relationships Denormalization • Denormalization is the converse process of normalization. • The process involves adding redundant data or grouping data to enhance the performance • Adding redundancies might seem to be a contradictory act, but joining several tables when executing a query (caused by normalization) might slow-down the database performance Dr Taleb Obaid 7
Normalization and Denormalization • Suppose a manufacturing company stores the employee details in a table emp_id emp_name emp_address emp_dept 101 Rick Delhi D 001 101 Rick Delhi D 002 123 Maggie Agra D 890 166 Glenn Chennai D 900 166 Glenn Chennai D 004 • The above table is not normalized Dr Taleb Obaid 8
Normalization and Denormalization First normal form (1 NF) • An attribute (column) of a table cannot hold multiple values. It should hold only atomic values. Example: Suppose a company wants to store the names and contact details of its employees. emp_id emp_name emp_address emp_mobile 101 Herschel New Delhi 8912312390 102 Jon Kanpur 881212 9900012222 103 Ron Chennai 7778881212 104 Lester Bangalore 9990000123 8123450987 • Two employees (Jon & Lester) are having two mobile numbers so the company stored them in the same field as you can see in the table above. • This table is not in 1 NF as the rule says “each attribute of a table must have atomic (single) values”, the emp_mobile values for employees Jon & Lester violates that rule. Dr Taleb Obaid 9
Normalization and Denormalization To make the table complies with 1 NF we should have the data like this: emp_id emp_name emp_address emp_mobile 101 Herschel New Delhi 8912312390 102 Jon Kanpur 881212 102 Jon Kanpur 9900012222 103 Ron Chennai 7778881212 104 Lester Bangalore 9990000123 104 Lester Bangalore 8123450987 Dr Taleb Obaid 10
Normalization and Denormalization Second normal form (2 NF) • A table is said to be in 2 NF if both the following conditions hold: • Table is in 1 NF (First normal form) • No non-prime attribute is dependent on the proper subset of any candidate key of table. • An attribute that is not part of any candidate key is known as nonprime attribute. Example: Suppose a school wants to store the data of teachers and the subjects they teach. They create a table that looks like this: • Since a teacher can teach more than one subjects, the table can have multiple rows for a same teacher. Dr Taleb Obaid 11
Normalization and Denormalization teacher_id subject teacher_age 111 Maths 38 111 Physics 38 222 Biology 38 333 Physics 40 111 38 Chemistry 40 The table is in 1 NF because each attribute has atomic values. However, it is not in 2 NF because non prime attribute teacher_age is dependent on teacher_id alone which is a proper subset of candidate key. This violates the rule for 2 NF as the rule says “no non-prime attribute is dependent on the proper subset of any candidate key of the table” Dr Taleb Obaid 12
Normalization and Denormalization To make the table complies with 2 NF we can break it in two tables like this: teacher_details table: teacher_id teacher_age 111 38 222 38 333 40 teacher_subject table: teacher_id subject 111 Maths 111 Physics 222 Biology 333 Physics Dr Taleb Obaid 333 Chemistry 13
Star Schema Star schema is one of the simplest schemas in a data warehouse. • A star schema has each of the dimensions represented in a single table. There are no hierarchies present between the dimensions • Consists of a single fact table surrounded by dimension tables. • There is only one join between the fact table and any one of the dimension tables • Called a star: a star schema because the diagram represents Dr Taleb Obaid 14
Star Schema Dr Taleb Obaid 15
Star Schema • Advantages of a Star Schema It has a high point of traceability and comprehensibility The data can be navigated easily. A single dimension table can be browsed to select the attributes to construct a query • Disadvantages of a Star Schema Redundancy in a dimension because of multiple storage of similar data A slower response in the query time due to the large dimension tables Dr Taleb Obaid 16
Snow Flake Schema • A snow flake schema is complex data warehouse model than a star schema. Dr Taleb Obaid 17
Snow Flake Schema A snow flake schema consists of at least one hierarchy among the dimension tables It consists of a fact table surrounded by dimension tables The dimension when normalized in a schema is known as a snow flake The snowflake schemas normalize dimensions to eliminate redundancy Advantages of a Snow Flake Schema The normalized structures are easier to maintain and update Small savings in storage space Disadvantages of a Snow Flake Schema Complexity of the schema does not seem user-friendly for the end users The navigation through the tables is difficult Dr Taleb Obaid 18
Constellation Schema The fact constellation schema is an extension of a star schema. The more refined applications may need multiple fact tables to share dimension tables. Different fact tables are explicitly assigned to the dimensions. This is advantageous for facts associated with a given dimension table and other facts with a deeper dimension level This schema resembles a collection of stars, and hence is called a galaxy schema or a fact constellation Dr Taleb Obaid 19
Why do we need a surrogate key? A surrogate key is a unique identifier used in databases for a modeled entity or an object. It is a unique key whose only significance is to act as the primary identifier of an object or entity. A primary key in a dimension is generally alphanumeric like ABF 635 which occupies lot of index space. This slows down index traversing Business keys are reused in a data warehouse. Maintaining versions becomes difficult when reusing the business keys Dr Taleb Obaid 20
Benefits of a surrogate key? Surrogate (replacement) keys are smaller integer numbers, hence they occupy a smaller index size Using surrogate keys can make dimension table handle the changes in it effectively. In cases where the business keys are reused surrogate keys act as a unique identifier Slowly Changing Dimensions(SCD) • Slowly changing dimensions determine how the historical changes in a dimension table are handled. Enabling a slowly changing dimension allows users to know which category of an item belongs to which date. • For example, people change their name due to several reasons. In order to know their names during a specific period of time we need to incorporate a slowly changing dimension Dr Taleb Obaid 21
Why do we need a surrogate key? Type 1 SCD is used when there is no need to maintain history in the database The new, changed data overwrites the old entries. This method is used for data that changes frequently due to misspells, trimming spaces and language specific characters The data is easier to maintain • For example, in the year 2013 the price of the product changes to $2000, then the values of the product price and year needs to be updated to the new values. • There is now way to find the rate of the product in the year 2012 since the values are over written as illustrated below Dr Taleb Obaid 22
Type 2 Slowly Changing Dimensions Type 2 SCD is used when there is a need to maintain the complete history in the database An additional dimension record is created and the segmentation between the old record values and the new values are easier to extract • For example, in the table below the product ID and the effective date time act as composite primary keys. Hence, there is no violation of the primary key here. Moreover, the product’s effective date time and expiry date time adds more clarity and improves the scope of the table Dr Taleb Obaid 23
Type 3 Slowly Changing Dimensions Type 3 SCD is used when there is a need to maintain the partial history in the database An ‘old’ or ‘new’ column is created that stores the immediate previous attribute. This is the least commonly used technique • For example, the latest update to the changed values can be seen. A new column is added to track the changes. We can see the current and the previous price of the product Dr Taleb Obaid 24
Role of SCDs in a Data warehouse The techniques discussed above can impact the data warehouse profoundly (deeply), both physically as well as logically The data warehouse architect has to analyze each particular scenario and decide the most efficient way to implement them The type 2 technique is the most frequently used in a data warehouse. Managers and business analysts would always be interested in the whereabouts (location) of the information. The types 2 provides this data that is critical for making important decisions Dr Taleb Obaid 25