Data Warehouses and Data Mining INFSY 445 Fall

  • Slides: 29
Download presentation
Data Warehouses and Data Mining INFSY 445 Fall 2005

Data Warehouses and Data Mining INFSY 445 Fall 2005

Data Warehouses n Relational database systems designed to store large amounts of data n

Data Warehouses n Relational database systems designed to store large amounts of data n Structure cases problems when complex queries are run n Performance problems n Add indexes to speed retrieval, slows transaction processing

Data Warehouses n Conflict resulted in: 4 Online transaction processing (OLTP) 4 Online analytical

Data Warehouses n Conflict resulted in: 4 Online transaction processing (OLTP) 4 Online analytical processing (OLAP) i. e. , data warehouse

Data Warehouses n Heavily indexed for data retrieval & analysis n Correlations and statistical

Data Warehouses n Heavily indexed for data retrieval & analysis n Correlations and statistical analysis

Sequential Files n Sequential files 4 Time consuming to search 4 Processing intense to

Sequential Files n Sequential files 4 Time consuming to search 4 Processing intense to add new row 4 Deleting row done by marking record and removing it during reorganization ID Last. Name First. Name Date. Hired 1 Reeves Keith 1/29/98 2 Gibson Bill 3/31/98 3 Reasoner Katy 2/17/98 4 Hopkins Alan 2/8/98 5 James Leisha 1/6/98 6 Eaton Anissa 8/23/98 7 Farris Dustin 3/28/98 8 Carpenter Carlos 12/29/98 9 O'Connor Jessica 7/23/98 Howard 7/13/98 10 Shields

Binary Search n Like looking through a dictionary or phone book 4 Start at

Binary Search n Like looking through a dictionary or phone book 4 Start at middle 4 Decide which half to continue with 4 Binary search continues to divide data in half until desired record is found

Pointers n Data not stored in tables actually n Data is stored on drive

Pointers n Data not stored in tables actually n Data is stored on drive (or RAM) 4 Operating System allocates space 4 Provides location/address • Physical address • Virtual address (VSAM) – Imaginary drive values mapped to physical locations. • Relative address – Distance from start of file

Pointers ID 1 2 3 4 5 6 7 8 9 10 Pointer A

Pointers ID 1 2 3 4 5 6 7 8 9 10 Pointer A 11 A 22 A 32 A 47 A 58 A 63 A 67 A 78 A 83 Last. Name Pointer Carpenter A 67 Eaton A 58 Farris A 63 Gibson A 22 Hopkins A 42 James A 47 O'Connor A 78 Reasoner A 32 Reeves A 11 Shields A 83 Address A 11 A 22 A 32 A 47 A 58 A 63 A 67 A 78 A 83 ID 1 2 3 4 5 6 7 8 9 10 Last. Name Reeves Gibson Reasoner Hopkins James Eaton Farris Carpenter O'Connor Shields First. Name Date. Hired Keith 1/29/98 Bill 3/31/98 Katy 2/17/98 Alan 2/8/98 Leisha 1/6/98 Anissa 8/23/98 Dustin 3/28/98 Carlos 12/29/98 Jessica 7/23/98 Howard 7/13/98 Indexed for ID and Last. Name

Pointers n Indexes allow search on 4 Primary key 4 Build index on any

Pointers n Indexes allow search on 4 Primary key 4 Build index on any column plan to search • CREATE INDEX command 4 Improve performance n Problems 4 Maintain indexes as rows are added and deleted 4 Process slows performance

Indexes n Index columns used to join tables n Test application with large amounts

Indexes n Index columns used to join tables n Test application with large amounts of data and heavy usage n May need a performance analyzer tool for DBMS 4 Monitor usage 4 Identify bottlenecks 4 Suggest columns to index

Data Warehouses n Create new database for online analytical processing n Data Warehouse 4

Data Warehouses n Create new database for online analytical processing n Data Warehouse 4 Holds copy of transaction data 4 Dedicated to managerial queries 4 Data has many sources 4 Multiple indexes built 4 Data may be de-normalized to improve retrieval performance

Data Warehouse Predefined reports Interactive data analysis Operations data Daily data transfer OLTP Database

Data Warehouse Predefined reports Interactive data analysis Operations data Daily data transfer OLTP Database 3 NF tables Flat files Data warehouse Star configuration

Data Warehouse n Tools available to search warehouse 4 Basic SQL queries 4 Extract

Data Warehouse n Tools available to search warehouse 4 Basic SQL queries 4 Extract data to spreadsheets n Data Mining 4 Automated tools which search for hidden patterns in data 4 Statistical methods like regression analysis 4 Pattern recognition like neural networks

Data Warehouse n Goals 4 Regularly collect , clean, and store data 4 Improve

Data Warehouse n Goals 4 Regularly collect , clean, and store data 4 Improve performance of OLAP queries • Denormalizing data n Challenges 4 Set up transfer system to collect and clean data 4 Design storage structure to maximize performance 4 Create data analysis tools to statistically analyze data (Data Mining)

Extraction, Transformation, and Transportation (ETT) Customers Convert Client to Customer Apply standard product numbers

Extraction, Transformation, and Transportation (ETT) Customers Convert Client to Customer Apply standard product numbers Convert currencies Fix region codes Transaction data from diverse systems. Data warehouse: All data must be consistent.

Data Warehouse n Vendors have tools to help automate collection of data n Problem

Data Warehouse n Vendors have tools to help automate collection of data n Problem is extracting data without interfering with OLTP operations

OLTP v. OLAP

OLTP v. OLAP

OLAP n Generally depict data as multidimensional cube n OLAP tools provide interactive capabilities

OLAP n Generally depict data as multidimensional cube n OLAP tools provide interactive capabilities to allow managers to look at data from any perspective 4 Drill Down – looking at details 4 Roll Up – summarized data like averages and totals

Multidimensional Cube y r go e t a C Customer Location Time Sale Date

Multidimensional Cube y r go e t a C Customer Location Time Sale Date

OLAP Database Design n Measure – numeric measurement of an attribute such as sales

OLAP Database Design n Measure – numeric measurement of an attribute such as sales value or quantity n Fact Table 4 Measures come from Fact Table 4 Generally a detail table in a database n Sides of Cube 4 Dimension – each attribute chosen from other tables 4 Dimension tables must be related to Fact Table

OLAP Database Design n Two OLAP designs 4 All of the dimension tables are

OLAP Database Design n Two OLAP designs 4 All of the dimension tables are joined directly to the fact table – Star Design 4 At least one dimension table is joined through a second table before being joined to the fact table – Snowflake Design

Star Design Dimension Tables Products Sales Date Fact Table Sales Quantity Amount=Sale. Price*Quantity Customer

Star Design Dimension Tables Products Sales Date Fact Table Sales Quantity Amount=Sale. Price*Quantity Customer Location

Snowflake Design Merchandise Sale Item. ID Description Quantity. On. Hand List. Price Category Sale.

Snowflake Design Merchandise Sale Item. ID Description Quantity. On. Hand List. Price Category Sale. ID Sale. Date Employee. ID Customer. ID Sales. Tax OLAPItems Sale. ID Item. ID Quantity Sale. Price Amount City. ID Zip. Code City State Customer. ID Phone First. Name Last. Name Address Zip. Code City. ID Dimension tables can join to other dimension tables.

OLAP Data Browsing

OLAP Data Browsing

OLAP in SQL 99 standard added features to compute basic OLAP-type results n Analytic

OLAP in SQL 99 standard added features to compute basic OLAP-type results n Analytic Functions 4 Standard deviation 4 Variance 4 Covariance 4 Correlation 4 Linear regression

Data Mining Goal: To discover unknown relationships in the data that can be used

Data Mining Goal: To discover unknown relationships in the data that can be used to make better decisions. Transactions and operations Reports Specific ad hoc questions Queries Aggregate, compare, drill down OLAP Databases Unknown relationships Data Mining

Common Techniques n Classification/Prediction/Regression n Association Rules/Market Basket Analysis n Clustering 4 Data points

Common Techniques n Classification/Prediction/Regression n Association Rules/Market Basket Analysis n Clustering 4 Data points 4 Hierarchies n Neural Networks

Common Techniques n Deviation Detection n Sequential Analysis 4 Time series events 4 Websites

Common Techniques n Deviation Detection n Sequential Analysis 4 Time series events 4 Websites n Textual Analysis n Spatial/Geographic Analysis

Homework n Find at least two commercial OLAP tools and compare the features.

Homework n Find at least two commercial OLAP tools and compare the features.