Data Analysis OLTP and OLAP Data Warehouse SQL

  • Slides: 20
Download presentation
Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining Bogdan

Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining Bogdan Shishedjiev Data Analysis 1

Data Processing • Data processing types – OLTP (On Line Transaction Processing) – OLAP

Data Processing • Data processing types – OLTP (On Line Transaction Processing) – OLAP (On-line Analytical Processing ) • Database types – Transactional • Numerous users • Dynamic (flickering) • Always maintaining current state • Critical (very loaded) – Warehouses • Aa few users (analyzers) • Relatively stable • Maintaining the data history (all states in the time) • Not loaded Bogdan Shishedjiev Data Analysis 2

Architecture Bogdan Shishedjiev Data Analysis 3

Architecture Bogdan Shishedjiev Data Analysis 3

Architecture • Source compnents – Filter – separate and assure the coherency of data

Architecture • Source compnents – Filter – separate and assure the coherency of data to be exported – Export – do the transfer of data portions in precise moments of time. • Warehouse components – – – Loader – Initial loading and preparing the warehouse. Refresh – loads the portions достъп data mining Export – to other warehouses. This creates an hierarchy of warehouses Bogdan Shishedjiev Data Analysis 4

Relational Schemes • Star Bogdan Shishedjiev Data Analysis 5

Relational Schemes • Star Bogdan Shishedjiev Data Analysis 5

Relational Schemes • Snowflake Bogdan Shishedjiev Data Analysis 6

Relational Schemes • Snowflake Bogdan Shishedjiev Data Analysis 6

Data Warehouse Design • Stages – Choose the activity processes to model – Choose

Data Warehouse Design • Stages – Choose the activity processes to model – Choose the granularity of the activity procesus – Choose the dimensions that can be applied to every record of the fact table. – Choose the facts that must be recorded in the fact table Bogdan Shishedjiev Data Analysis 7

Data Warehouse Design • Fact types – the most valuable are numerical continuous values

Data Warehouse Design • Fact types – the most valuable are numerical continuous values – Additive – they can be added along all dimensions (Money amounts) – Semi-additive – they can be added along some of dimensions (Precipitations, Product quantities) – Non-additive – they cannot be added along any dimension (Wind speed, wind direction) Bogdan Shishedjiev Data Analysis 8

Data Warehouse Design • Recommendations – – Use continuous additive numerical values The fact

Data Warehouse Design • Recommendations – – Use continuous additive numerical values The fact table is highly normalized Don’t normalize the dimensions. The gain is < 1% Design thoroughly the dimension attributes. Most often they are textual and discrete. They are used as headings and constraint sources in the answers to users Bogdan Shishedjiev Data Analysis 9

Dimension time_key day_of_ week day_no_in_month day_no_overall week_no_ln_year week_no_overall month_no_overall quarter fiscal_period holiday_flag weekday_flag last_day_in_month_flag

Dimension time_key day_of_ week day_no_in_month day_no_overall week_no_ln_year week_no_overall month_no_overall quarter fiscal_period holiday_flag weekday_flag last_day_in_month_flag season event Dimension shop stoie_key store_ name store_number store_street_address store_city store_county store_state store_zip store__manager store_phone store_FАX floor_plan_ type photo_processing_type finance_services_type first_opened_date last_remodel_date store_surface grocery_surface frozen-surface. . and others Example – Hypermarket Chain Facts - Sales Time_key product_key Store_key promotion key dollar_sales units_sales dollar_cost customer-count Bogdan Shishedjiev Data Analysis Dimension product_key SKU(stock keeping units )_description SKU_number package_size brand subcategory departement package_ type diet_type weight_unit_of_mesure units_per_retail_case units_per_shipping_cases_per_pallet shelf_width shelf_height shelf_depth. . and others Promotion promotion_key promotion_name price__reduction_ type ad_type dlsplay_ type coupon_type ad_media_name display_provider promo_cost promo_begin_date promo_end-date. . and others 10

Hypermarket Chain • Fact table – Granularity – each sell of a product (SKU

Hypermarket Chain • Fact table – Granularity – each sell of a product (SKU – Stock Keeping Unit) – Values – Total cost (additive), SKU quantity (semiadditive), price (non-additive), customer count (non additive) • Dimensions – – Time Product Store Promotion Bogdan Shishedjiev Data Analysis 11

Hypermarket Chain • Calculation of disk space needed – Dimension time : 2 years

Hypermarket Chain • Calculation of disk space needed – Dimension time : 2 years x 365 days = 730 days – Dimension shop : 300 shops, everyday records – Dimension product : 30. 000 products in each shop; 3000 are sold every day in each shop. – Dimension promotion : An article can participate in only one promotion in a shop during one day. – Elementary fact records 300 x 730 x 3000 x 1 = 657. 106 records – Key field number 4; Value field number 4 ; Total number osf fields =8 – Fact table size - 657. 106 x 8 fields x 4 B = 21 GB Bogdan Shishedjiev Data Analysis 12

Data Operations for Data Analysis • General form of a SQL statement select D

Data Operations for Data Analysis • General form of a SQL statement select D 1. C 1, . . . Dn. Cn, Aggr 1(F, Cl), …, Aggrn(F, Cn) from Fact as F, Dimension 1 as D 1, . . . Dimension. N as Dn where join-condition (F, D 1) and. . . and join-condition (F, Dn) and selection-condition group by D 1. C 1, . . . Dn. Cn order by D 1. C 1, . . . Dn. C Bogdan Shishedjiev Data Analysis 13

Data Operations for Data Analysis • Example select Time. Month, Product. Name, sum(Qty) from

Data Operations for Data Analysis • Example select Time. Month, Product. Name, sum(Qty) from Sale, Time, Product, Promotion where Sale. Time. Code = Time. Code and Sale. Product. Code = Product. Code and Sale. Promo. Code = Promotion. Promo. Code and (Product. Name = ' Pasta' or Product. Name = 'Oil') and Time. Month between 'Feb' and 'Apr' and Promotion. Name = 'Super. Saver' group by Time. Month, Product. Name order by Time. Month, Product. Name pivot Time. Month Feb Mar Apr Oil 5 K 5 K 7 K Pasta 45 K 50 K 51 K Bogdan Shishedjiev Data Analysis 14

Data Cube The cube is used to represent data along some measure of interest.

Data Cube The cube is used to represent data along some measure of interest. Although called a "cube", it can be 2 -dimensional, 3 -dimensional, or higher-dimensional. Each dimension represents some attribute in the database and the cells in the data cube represent the measure of interest. Bogdan Shishedjiev Data Analysis 15

Data Cube • Data cube representation Co u n t Combination Count {P 3,

Data Cube • Data cube representation Co u n t Combination Count {P 3, Vancouver, Richard} 9 {P 1, Calgary, Vance} 2 {P 2, Calgary, Vance} 4 {P 4, Vancouver, Richard} 2 {P 3, Calgary, Vance} 1 {P 5, Vancouver, Richard} 9 {P 1, Toronto, Vance} 5 {P 1, Calgary, Richard} 2 {P 3, Toronto, Vance} 8 {P 2, Calgary, Richard} 1 {P 5, Toronto, Vance} 2 {P 5, Montreal, Vance} 5 {P 3, Calgary, Richard} 4 {P 1, Vancouver, Bob} 3 {P 2, Calgary, Allison} 2 {P 3, Vancouver, Bob} 5 {P 3, Calgary, Allison} 1 {P 5, Vancouver, Bob} 1 {P 1, Toronto, Allison} 2 {P 1, Montreal, Bob} 3 {P 2, Toronto, Allison} 3 {P 3, Montreal, Bob} 8 {P 3, Toronto, Allison} 6 {P 4, Montreal, Bob} 7 {P 4, Toronto, Allison} 2 {P 5, Montreal, Bob} 3 {P 2, Vancouver, Richard} 11 Bogdan Shishedjiev Data Analysis 16

Data Cube • Totals - the value ANY or ALL or NULL Bogdan Shishedjiev

Data Cube • Totals - the value ANY or ALL or NULL Bogdan Shishedjiev Data Analysis 17

Data Cube • Drill down – adding a dimension for more detailed results Time.

Data Cube • Drill down – adding a dimension for more detailed results Time. Month Product. Name sum(Qty) Feb Pasta Mar Apr TIme. Monih Product. Name Zone sum(Qty) 48 K Feb Pasta North 18 K Pasta 50 K Feb Pasta Centre 18 K Pasta 51 K Feb Pasta South 12 K Mar Pasta North 18 K Mar Pasta Centre 18 K Mar Pasta South 14 K Apr Pasta North 18 K Apr Pasta Centre 17 K Apr Pasta South 16 K Bogdan Shishedjiev Data Analysis 18

Data Cube • Roll-up - removing dimension TIme. Monih Product. Name Zone sum(Qty) Feb

Data Cube • Roll-up - removing dimension TIme. Monih Product. Name Zone sum(Qty) Feb Pasta North 18 K Feb Pasta Centre 18 K Feb Pasta South 12 K Mar Pasta North 18 K Mar Pasta Centre 18 K Mar Pasta South 14 K Apr Pasta North 18 K Apr Pasta Centre 17 K Apr Pasta South 16 K Product. Name Zone sum(Qty) Pasta North 54 K Pasta Centre 53 K Pasta South 42 K Bogdan Shishedjiev Data Analysis 19

Data Cube • The whole data cube TIme. Monih Feb Feb Mar Mar Apr

Data Cube • The whole data cube TIme. Monih Feb Feb Mar Mar Apr Apr ALL ALL Feb Mar Apr ALL Product. Name Pasta Pasta Pasta Pasta ALL Zone North Centre South ALL ALL ALL Bogdan Shishedjiev Data Analysis sum(Qty) 18 K 12 K 18 K 14 K 18 K 17 K 16 K 54 K 53 K 42 K 48 K 50 K 51 K 149 K 20