Unit 1 Data Warehousing Introduction to Data Warehousing

Introduction to Data Warehousing v. Defined in many different ways, but not rigorously. Ø

Data Warehouse—Subject-Oriented v. Organized around major subjects, such as customer, product, sales. v. Focusing

Data Warehouse—Integrated v. Constructed by integrating multiple, heterogeneous data sources Ørelational databases, flat files,

Data Warehouse—Time Variant v. The time horizon for the data warehouse is significantly longer

Data Warehouse—Non-Volatile v. A physically separate store of data transformed from the operational environment.

Data Warehouse vs. Heterogeneous DBMS v. Traditional heterogeneous DB integration: ØBuild wrappers/mediators on top

Data Warehouse vs. Operational DBMS v. OLTP (on-line transaction processing) ØMajor task of traditional

Extraction, Transformation, and Loading (ETL) Processes v. The “plumbing” work of data warehousing v.

Data Extraction v. Often performed by COBOL routines (not recommended because of high program

Reasons for “Dirty” Data v. Dummy Values v. Absence of Data v. Multipurpose Fields

Data Cleansing v. Source systems contain “dirty data” that must be cleansed v. ETL

Parsing v. Parsing locates and identifies individual data elements in the source files and

Correcting v. Corrects parsed individual data components using sophisticated data algorithms and secondary data

Standardizing v. Standardizing applies conversion routines to transform data into its preferred (and consistent)

Matching v. Searching and matching records within and across the parsed, corrected and standardized

Consolidating v. Analyzing and identifying relationships between matched records and consolidating/merging them into ONE

Data Transformation v. Transforms the data in accordance with the business rules and standards

Warehouses Design Issues Data Warehouse Design Process v. Choose the grain (atomic level of

Two Data Warehousing Strategies v. Enterprise-wide warehouse, top down, the Inmon methodology v. Data

Enterprise-wide Strategies v. A comprehensive warehouse is built initially v. An initial dependent data

Guidelines for Data Warehouse Implementation Three Data Warehouse Models v. Enterprise warehouse Øcollects all

Data Warehouse Development: A Recommended Approach

Data Warehouse Meta Data v. Data about data v. Needed by both information technology

Slides: 27

Download presentation

Unit 1 Data Warehousing

Introduction to Data Warehousing v. Defined in many different ways, but not rigorously. Ø A decision support database that is maintained separately from the organization’s operational database Ø Support information processing by providing a solid platform of consolidated, historical data for analysis. v “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process. ”—W. H. Inmon v. Data warehousing: ØThe process of constructing and using data warehouses

Data Warehouse—Subject-Oriented v. Organized around major subjects, such as customer, product, sales. v. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. v. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.

Data Warehouse—Integrated v. Constructed by integrating multiple, heterogeneous data sources Ørelational databases, flat files, on-line transaction records v. Data cleaning and data integration techniques are applied. ØEnsure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E. g. , Hotel price: currency, tax, breakfast covered, etc. ØWhen data is moved to the warehouse, it is converted.

Data Warehouse—Time Variant v. The time horizon for the data warehouse is significantly longer than that of operational systems. ØOperational database: current value data. ØData warehouse data: provide information from a historical perspective (e. g. , past 5 -10 years) v. Every key structure in the data warehouse ØContains an element of time, explicitly or implicitly ØBut the key of operational data may or may not contain “time element”.

Data Warehouse—Non-Volatile v. A physically separate store of data transformed from the operational environment. v. Operational update of data does not occur in the data warehouse environment. ØDoes not require transaction processing, recovery, and concurrency control mechanisms ØRequires only two operations in data accessing: qinitial loading of data and access of data.

Operational Data Stores (ODS)

Data Warehouse vs. Heterogeneous DBMS v. Traditional heterogeneous DB integration: ØBuild wrappers/mediators on top of heterogeneous databases ØQuery driven approach q. When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set q. Complex information filtering, compete for resources v. Data warehouse: update-driven, high performance ØInformation from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis

Data Warehouse vs. Operational DBMS v. OLTP (on-line transaction processing) ØMajor task of traditional relational DBMS ØDay-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. v. OLAP (on-line analytical processing) ØMajor task of data warehouse system ØData analysis and decision making v. Distinct features (OLTP vs. OLAP): ØUser and system orientation: customer vs. market ØData contents: current, detailed vs. historical, consolidated ØDatabase design: ER + application vs. star + subject ØView: current, local vs. evolutionary, integrated ØAccess patterns: update vs. read-only but complex queries

Extraction Transformation Loading (ETL)

Extraction, Transformation, and Loading (ETL) Processes v. The “plumbing” work of data warehousing v. Data are moved from source to target data bases v. A very costly, time consuming part of data warehousing Recent Development: More Frequent Updates v. Updates can be done in bulk and trickle modes v. Business requirements, such as trading partner access to a Web site, requires current data v. For international firms, there is no good time to load the warehouse

Data Extraction v. Often performed by COBOL routines (not recommended because of high program maintenance and no automatically generated meta data) v. Sometimes source data is copied to the target database using the replication capabilities of standard RDBMS routine (not recommended because of “dirty data” in the source systems) v. Increasing performed by specialized ETL software Sample ETL Tools v. Teradata Warehouse Builder from Teradata v. Data. Stage from Ascential Software v. SAS System from SAS Institute v. Power Mart/Power Center from Informatica v. Sagent Solution from Sagent Software v. Hummingbird Genio Suite from Hummingbird Communications

Reasons for “Dirty” Data v. Dummy Values v. Absence of Data v. Multipurpose Fields v. Cryptic Data v. Contradicting Data v. Inappropriate Use of Address Lines v. Violation of Business Rules v. Reused Primary Keys, v. Non-Unique Identifiers v. Data Integration Problems

Data Cleansing v. Source systems contain “dirty data” that must be cleansed v. ETL software contains rudimentary data cleansing capabilities v. Specialized data cleansing software is often used. Important for performing name and address correction and house holding functions v. Leading data cleansing vendors include Vality (Integrity), Harte-Hanks (Trillium), and Firstlogic (i. d. Centric) Steps in Data Cleansing v. Parsing v. Correcting v. Standardizing v. Matching v. Consolidating

Parsing v. Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. v. Examples include parsing the first, middle, and last name; street number and street name; and city and state.

Correcting v. Corrects parsed individual data components using sophisticated data algorithms and secondary data sources. v. Example include replacing a vanity address and adding a zip code.

Standardizing v. Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules. v. Examples include adding a pre name, replacing a nickname, and using a preferred street name.

Matching v. Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications. v. Examples include identifying similar names and addresses.

Consolidating v. Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.

Data Transformation v. Transforms the data in accordance with the business rules and standards that have been established v. Example include: format changes, de duplication, splitting up fields, replacement of codes, derived values, and aggregates Data Loading v. Data are physically moved to the data warehouse v. The loading takes place within a “load window” v. The trend is to near real time updates of the data warehouse as the warehouse is increasingly used for operational applications

Warehouses Design Issues Data Warehouse Design Process v. Choose the grain (atomic level of data) of the business process v. Choose a business process to model, e. g. , orders, invoices, etc. v. Choose the dimensions that will apply to each fact table record v. Choose the measure that will populate each fact table record

Two Data Warehousing Strategies v. Enterprise-wide warehouse, top down, the Inmon methodology v. Data mart, bottom up, the Kimball methodology v. When properly executed, both result in an enterprise-wide data warehouse Data Mart Strategies v. The most common approach v. Begins with a single mart and architected marts are added over time for more subject areas v. Relatively inexpensive and easy to implement v. Can be used as a proof of concept for data warehousing v. Can perpetuate the “silos of information” problem v. Can postpone difficult decisions and activities v. Requires an overall integration plan

Enterprise-wide Strategies v. A comprehensive warehouse is built initially v. An initial dependent data mart is built using a subset of the data in the warehouse v. Additional data marts are built using subsets of the data in the warehouse v. Like all complex projects, it is expensive, time consuming, and prone to failure v. When successful, it results in an integrated, scalable warehouse Data Sources and Types v. Primarily from legacy, operational systems v. Almost exclusively numerical data at the present time v. External data may be included, often purchased from third-party sources v. Technology exists for storing unstructured data and expect this to become more important over time

Multi-Tiered Architecture

Guidelines for Data Warehouse Implementation Three Data Warehouse Models v. Enterprise warehouse Øcollects all of the information about subjects spanning the entire organization v. Data Mart Øa subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart q. Independent vs. dependent (directly from warehouse) data mart v. Virtual warehouse ØA set of views over operational databases ØOnly some of the possible summary views may be materialized

Data Warehouse Development: A Recommended Approach

Data Warehouse Meta Data v. Data about data v. Needed by both information technology personnel and users v. IT personnel need to know data sources and targets; database, table and column names; refresh schedules; data usage measures; etc. v. Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc.