Unit 1 Data Warehousing and Online Analytical Processing

  • Slides: 80
Download presentation
Unit 1 : Data Warehousing and On-line Analytical Processing 1

Unit 1 : Data Warehousing and On-line Analytical Processing 1

What is a Data Warehouse? n Defined of DW in many different ways n

What is a Data Warehouse? n Defined of DW in many different ways n DW can be defined as a repository of purposely selected operational data which can successfully answer any complex queries. n “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decisionmaking process. ”—W. H. Inmon n Goal of DW: n Increase effectiveness of decision making process n Data in DW has to be preprocessed n DW is a storage system n In DW large volumes of data is stored n In DW we retrieve desirable information very easy and reliable way 2

Introduction n n DW , OLAP, DM are growing trends in information technology. In

Introduction n n DW , OLAP, DM are growing trends in information technology. In DW data has to be systematically organized, understand used to make strategic decisions. DW is a marketing weapon to retain customers by learning about their needs DW refers to database that is maintained separately from organizations operational database DW provide solid platform to consolidated historical data. 17 September 2020 Data Mining: Concepts and Techniques 3

Introduction n DW is constructed by integrating data from multiple heterogeneous sources to support

Introduction n DW is constructed by integrating data from multiple heterogeneous sources to support complex queries, analytical reporting, decision making DW construction requires Data cleaning Data integration Data consolidation It allows knowledge workers to use DW to quickly obtain overview of data and make decisions 17 September 2020 Data Mining: Concepts and Techniques 4

Why organizations use information in DW n n n Increase customer focus Analyze operations

Why organizations use information in DW n n n Increase customer focus Analyze operations Increase profits Managing customer relationships They collect information from heterogeneous sources to provide efficient and easy access 17 September 2020 Data Mining: Concepts and Techniques 5

Characteristics of data warehouse It has four characteristics n Subject oriented n Integrated n

Characteristics of data warehouse It has four characteristics n Subject oriented n Integrated n Time variant n Non volatile 17 September 2020 Data Mining: Concepts and Techniques 6

Data Warehouse—Subject-Oriented n DW mainly focuses or organized around important/major subjects n It doesn’t

Data Warehouse—Subject-Oriented n DW mainly focuses or organized around important/major subjects n It doesn’t concentrate on day to day activities and transaction processing of organization n It Focuses on the modeling and analysis of data for decision making n It focus on subjects that are critical to organization n It excludes data that is not useful in the decision support process 7

Data Warehouse—Integrated n n n DW is Constructed by integrating multiple, heterogeneous data sources

Data Warehouse—Integrated n n n DW is Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Because of heterogeneity data is stored in inconsistent manner To maintain consistency and reliability various techniques are applied Data cleaning and data integration techniques are applied. 8

Data Warehouse—Time Variant n n DW not only stores current information but also stores

Data Warehouse—Time Variant n n DW not only stores current information but also stores historical information. The time horizon for the data warehouse is significantly longer than that of operational systems n n Operational database: current value data Data warehouse data: provide information from a historical perspective (e. g. , past 5 -10 years) Every key structure in the data warehouse. Contains an element of time, explicitly or implicitly Advantages of time variant: n Future prediction n Enables analyzing historical data n It associates information of past to present data 9

Data Warehouse—Nonvolatile n Data is transferred into DW at regular intervals of time n

Data Warehouse—Nonvolatile n Data is transferred into DW at regular intervals of time n Once data enters DW it remains static until particular event is triggered. n In order to access data in DW it has 2 operations n Data loading n Data accessing 10

Query driven & update driven approach n n n n Query is posted in

Query driven & update driven approach n n n n Query is posted in client side Meta data dictionary translates query for individual heterogeneous databases This query is mapped and sent to local query processor The result returned from different sites is integrated This is called “query driven approach” Update driven approach: DW employs “update driven approach” in which information from heterogeneous sources are integrated in advance and stored in DW for direct querying and analysis. 17 September 2020 Data Mining: Concepts and Techniques 11

OLTP vs. OLAP OLTP: n The major task is to perform online transaction and

OLTP vs. OLAP OLTP: n The major task is to perform online transaction and query processing n They cover day to day operations n They serve knowledge workers in decision making OLAP n They organize data in various formats inorder to accommodate diverse needs of different users n They manage large amounts of historical data n Summarization n Aggregation n Stores data at different granularity 17 September 2020 Data Mining: Concepts and Techniques 12

OLTP vs. OLAP n n n The major distinguishing features between OLTP vs. OLAP

OLTP vs. OLAP n n n The major distinguishing features between OLTP vs. OLAP are as follows Users and system orientation Data content Data base design View Access patterns 17 September 2020 Data Mining: Concepts and Techniques 13

OLTP vs. OLAP OLTP OLAP USER’S AND SYSTEM ORIENTATION It is customer oriented It

OLTP vs. OLAP OLTP OLAP USER’S AND SYSTEM ORIENTATION It is customer oriented It is market oriented It is used for query processing by clerk and clients It is used for data analysis by knowledge workers (managers, analyst) Data content OLTP OLAP It manages current data It manages large amounts of historical data They are used for decision making It provides summarization, aggregation 17 September 2020 Data Mining: Concepts and Techniques 14

OLTP vs. OLAP OLTP OLAP Database design It adopts ER Model and It adopts

OLTP vs. OLAP OLTP OLAP Database design It adopts ER Model and It adopts star/snowflake model It adopts application oriented database design It adopts subject oriented database design View OLTP OLAP It focus on current data It uses data from different organizations so it has huge volume of data. Access patterns They are short atomic transactions. it requires concurrency control and recovery mechanisms 17 September 2020 They are mostly read only operations Data Mining: Concepts and Techniques 15

OLTP vs. OLAP 16

OLTP vs. OLAP 16

Why a Separate Data Warehouse? n High performance for both systems n n n

Why a Separate Data Warehouse? n High performance for both systems n n n OLTP DBMS is tuned for: access methods, indexing, concurrency control, recovery OLAP Warehouse is tuned for : complex OLAP queries, multidimensional view, consolidation missing data: Decision support requires historical data which operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled 17

Multi dimensional data model 17 September 2020 Data Mining: Concepts and Techniques 18

Multi dimensional data model 17 September 2020 Data Mining: Concepts and Techniques 18

From Tables and Spreadsheets to Data Cubes n A data warehouse is based on

From Tables and Spreadsheets to Data Cubes n A data warehouse is based on a multidimensional data model which views data in the form of a data cube n Data cube: it allows data to be modeled and viewed using multiple dimensions n Data cube is defined by dimensions n Organization stores records in accordance with dimensions n Eg: automobile_sale_database n It maintains information about sales that tool place in a particular division wrt dimensions (time, item, branch, location) n A data cube allows data to be modeled and viewed in multiple dimensions 19

Dimension & fact table Data cube allows us to keep track of things Each

Dimension & fact table Data cube allows us to keep track of things Each dimension may have table associated with it called dimension table This dimension table is is specified by user/expert/automatically genrated Eg: Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table: multi dimensional data model is typically organized around central theme. This theme is represented by a fact table n n n Fact table contains measures (such as dollars_sold) keys to each of the related dimension tables 17 September 2020 Data Mining: Concepts and Techniques 20

Base & apex cuboid n In data warehousing literature, an n-D base cube is

Base & apex cuboid n In data warehousing literature, an n-D base cube is called a base cuboid. n The top most 0 -D cuboid, which holds the highest-level of summarization, is called the apex cuboid. n The lattice of cuboids forms a data cube. n n-D cuboid is series of (n-1) D cuboid n 4 -D cube is being series of 3 -D cubes n Given a set of dimensions we can generate a cuboid for each of possible set of given dimensions n The result would form lattice of cuboids which shows data at different levels of summarization 17 September 2020 Data Mining: Concepts and Techniques 21

Cube: A Lattice of Cuboids all time 0 -D (apex) cuboid item time, location

Cube: A Lattice of Cuboids all time 0 -D (apex) cuboid item time, location time, item location item, location time, supplier location, supplier item, supplier time, location, supplier time, item, location time, item, supplier 1 -D cuboids 2 -D cuboids 3 -D cuboids item, location, supplier 4 -D (base) cuboid time, item, location, supplier 22

Conceptual Modeling of Data Warehouses n The ER Model is used for design of

Conceptual Modeling of Data Warehouses n The ER Model is used for design of relational database n The most popular data model for DW is multi dimensional data model n Such a MD model can exist in form of star/snow flake/fact constellation schema n Modeling data warehouses: dimensions & measures n Star schema: A fact table in the middle connected to a set of dimension tables n Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake n Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 23

STAR SCHEMA n n This schema contains central fact table connected to set of

STAR SCHEMA n n This schema contains central fact table connected to set of dimensions tables Each dimension is represent by only one table Each table contains set of attributes Fact table has keys of all dimension table and measures 17 September 2020 Data Mining: Concepts and Techniques 24

Example of Star Schema time item time_key day_of_the_week month quarter year Sales Fact Table

Example of Star Schema time item time_key day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key branch_name branch_type location_key units_sold dollars_sold avg_sales item_key item_name brand type supplier_type location_key street city state_or_province country Measures 25

Snowflake SCHEMA n n n n In this schema is variant of star schema

Snowflake SCHEMA n n n n In this schema is variant of star schema In this some dimension tables are normalized there by further splitting the data into additional tables The resulting graph forms a shape similar to snow flake Redundancy is reduced here In this more joins are required to execute a query Each dimension table contains set of attributes Fact table has keys of all dimension table and measures 17 September 2020 Data Mining: Concepts and Techniques 26

Defining star schema n n n n n Data mining query language can be

Defining star schema n n n n n Data mining query language can be used to specify data mining tasks In data warehouse and data mart we have to define cube, define dimension Cube definition has the following syntax: define cube(cube_name)[dimension_list)]: (measure list) Dimension definition: define dimension (dimension _name) as ((attribute or dimension _list)) Eg: define cube sales_star[time, item, branch, location]: dollars_sold=sum(sales_in_dollars) , units_sold=count(*) Define dimension time as(time_key, day_of_week, month, quarter, year) 17 September 2020 Data Mining: Concepts and Techniques 27

Example of Snowflake Schema time_key day_of_the_week month quarter year item Sales Fact Table time_key

Example of Snowflake Schema time_key day_of_the_week month quarter year item Sales Fact Table time_key item_key branch location_key branch_name branch_type units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key supplier_type location_key street city_key city state_or_province country 28

Defining snow flake schema n n n n Data mining query language can be

Defining snow flake schema n n n n Data mining query language can be used to specify snowflake schema Cube definition has the following syntax: Eg: define cube sales_snow flake[time, item, branch, location]: dollars_sold=sum(sales_in_dollars), unit s_sold=count(*) Define dimension time as(time_key, day_of_week, month, quarter, year) Define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) The definition here is similar to star schema except that the dimensions are normalized 17 September 2020 Data Mining: Concepts and Techniques 29

Fact constellation n n This schema specifies two fact tables It allows dimension tables

Fact constellation n n This schema specifies two fact tables It allows dimension tables to be shared between fact tables 17 September 2020 Data Mining: Concepts and Techniques 30

Example of Fact Constellation time_key day_of_the_week month quarter year item Sales Fact Table time_key

Example of Fact Constellation time_key day_of_the_week month quarter year item Sales Fact Table time_key item_key item_name brand type supplier_type location_key branch_name branch_type units_sold dollars_sold avg_sales Measures time_key item_key shipper_key from_location branch_key branch Shipping Fact Table location to_location_key street city province_or_state country dollars_cost units_shipped shipper_key shipper_name location_key shipper_type 31

Defining fact constellation schema n n n n n Cube definition has the following

Defining fact constellation schema n n n n n Cube definition has the following syntax: define cube sales_star[time, item, branch, location]: dollars_sold=sum(sales_in_dollars) , units_sold=count(*) Define dimension time as(time_key, day_of_week, month, quarter, year) Define dimension item as(item_key, item_name, brand, type, supplier_type) Define dimension branch as(branch_key, branch_name, branch_type) Define dimension location as(location_key, street, city, provinance_state, country) Define cube shipping(time, item, shipper, from_location, to_location) Define dimension time as time is cube sales Define dimension from_location as location in cube slaes 17 September 2020 Data Mining: Concepts and Techniques 32

Data Cube Measures: Three Categories n n n Data cube measure is a numerical

Data Cube Measures: Three Categories n n n Data cube measure is a numerical function that can be evaluated at each point in the data cube space Measures can be organized into 3 categories Distributive Algebraic Holistic Distributive: An aggregate fuction is distributive if it can be computed in distributive manner as follows Data are partitioned into n sets We apply the function to each partition. Results in n aggregate values if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning E. g. , count(), sum(), min(), max() 17 September 2020 Data Mining: Concepts and Techniques 33

Data Cube Measures: Three Categories n Algebraic: if it can be computed by an

Data Cube Measures: Three Categories n Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function n E. g. , avg()= sum()/count() Where sum() and count() are distributive aggregate functions, min_N(), standard_deviation() n Holistic: if there is no constant bound on the storage size needed to describe a subaggregate. n E. g. , median(), mode(), rank() 34

Concept hierarchy n n n It defines sequence of mappings from set of low

Concept hierarchy n n n It defines sequence of mappings from set of low level concepts to higher level Many of concept hierarchies are implicit within the database schema These attributes are related by total order forming concept hierarchy like “street<city<state<country” Alternatively the attributes may be organized in a partial order forming a lattice “day<{month<quarter; week}<year” 17 September 2020 Data Mining: Concepts and Techniques 35

Total order concept heirarchy country State City street 17 September 2020 Data Mining: Concepts

Total order concept heirarchy country State City street 17 September 2020 Data Mining: Concepts and Techniques 36

Partial order concept hierarchy year quarter month week day 17 September 2020 Data Mining:

Partial order concept hierarchy year quarter month week day 17 September 2020 Data Mining: Concepts and Techniques 37

Set grouping hierarchy of attribute price $0 -$400 $0 -$200 $0 -$100 -$200 -$400

Set grouping hierarchy of attribute price $0 -$400 $0 -$200 $0 -$100 -$200 -$400 $200 -$300 -$400

Concept heirarchy n n n Total/partial order among attributes in database schema is called

Concept heirarchy n n n Total/partial order among attributes in database schema is called schema hierarchy Set grouping hierarchy: Concept hierarchies may also be defined by grouping values for a given dimension/attribute {$x…$y} $x exclusive $y inclusive There may be more than one concept hierarchy for given attribute Concept hierarchies may be grouped manually by experts, users or automatically generated Concept hierarchies allows data to be handled at varying levels of abstraction 17 September 2020 Data Mining: Concepts and Techniques 39

A Concept Hierarchy: Dimension (location) all Europe region country city office Germany Frankfurt .

A Concept Hierarchy: Dimension (location) all Europe region country city office Germany Frankfurt . . Spain North_America Canada Vancouver. . . L. Chan . . . Mexico Toronto M. Wind 40

OLAP OPERATIONS IN MD DATA MODEL n n n n n In MD data

OLAP OPERATIONS IN MD DATA MODEL n n n n n In MD data model data is organized into multiple dimensions Each dimension contains multiple levels of abstraction defined by concept hierarchies OLAP provides user friendly environment for interactive data analysis Roll up: it is performed on centre cube by climbing up the concept hierarchy It shows the aggregate of data by ascending the location heirarchy from the level of city to level of country Dimension reduction occurs when roll up is performed Drill down: it is reverse of roll up It navigates from less detailed to more detailed data It occurs by descending the concept heirarchy It adds more details 17 September 2020 Data Mining: Concepts and Techniques 41

Typical OLAP Operations n n n Roll up (drill-up): summarize data n by climbing

Typical OLAP Operations n n n Roll up (drill-up): summarize data n by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up n from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: it performs project and select operation on one dimension of the given cube Dice: it performs project and select operation on two or more dimension of the given cube Pivot (rotate): n reorient the cube, visualization, 3 D to series of 2 D planes n It provides alternative presentation of the data Other operations n drill across: it executes queries that involve (across) more than one fact table n drill through: through the bottom level of the cube to its back-end relational tables (using SQL) 42

Fig. 3. 10 Typical OLAP Operations 43

Fig. 3. 10 Typical OLAP Operations 43

Multidimensional Data Sales volume as a function of product, month, and region gi on

Multidimensional Data Sales volume as a function of product, month, and region gi on Dimensions: Product, Location, Time Hierarchical summarization paths Re Industry Region Year Category Country Quarter Product n Product City Office Month Week Day Month 44

Pr od TV PC VCR sum 1 Qtr 2 Qtr Date 3 Qtr 4

Pr od TV PC VCR sum 1 Qtr 2 Qtr Date 3 Qtr 4 Qtr sum Total annual sales of TVs in U. S. A Canada Mexico Country uc t A Sample Data Cube sum 45

Cuboids Corresponding to the Cube all 0 -D (apex) cuboid product, date country product,

Cuboids Corresponding to the Cube all 0 -D (apex) cuboid product, date country product, country 1 -D cuboids date, country 2 -D cuboids product, date, country 3 -D (base) cuboid 46

A Star-Net Query Model n n Querying of MD data model can be based

A Star-Net Query Model n n Querying of MD data model can be based on star net model It consists of radial lines emanating from a central point Each line represents a concept hierarchy Each abstraction level in the hierarchy is called a foot print 17 September 2020 Data Mining: Concepts and Techniques 47

A Star-Net Query Model Customer Orders Shipping Method Customer CONTRACTS AIR-EXPRESS ORDER TRUCK Time

A Star-Net Query Model Customer Orders Shipping Method Customer CONTRACTS AIR-EXPRESS ORDER TRUCK Time PRODUCT LINE ANNUALY QTRLY DAILY CITY Product PRODUCT ITEM PRODUCT GROUP SALES PERSON COUNTRY DISTRICT REGION Location Each circle is called a footprint DIVISION Promotion Organization 48

Browsing a Data Cube n n n Visualization OLAP capabilities Interactive manipulation 49

Browsing a Data Cube n n n Visualization OLAP capabilities Interactive manipulation 49

Steps for design and construction of data warehouses n n n n n Design

Steps for design and construction of data warehouses n n n n n Design of data warehouse: business analysis framework DW can enhance business productivity DW quickly and efficiently, accurately, gathers information It facilitates customer relationship management It facilitates cost reduction by tracking trends To design effective DW we need to analyze the business needs and construct business analysis framework. Building DW is a complex task. It requires business skills, technology skills, program management skills. Extractor : transform data from operational systems to DW Ware house refresh software: it keeps the DW up to date information 17 September 2020 Data Mining: Concepts and Techniques 50

Four views for design of data warehouse Top down view: it allows selection of

Four views for design of data warehouse Top down view: it allows selection of relevant information necessary for DW. n This information should match current and future trends The data source view: it exposes information being captured, stored and managed by operational systems n Information is documented at different levels of details n Data source is often modeled using ER, CASE Tools Data warehouse view: it includes fact & dimensions tables. n It also provides time of origin to historical context. Business query view: it provides perspective of data from the view point of end user 17 September 2020 Data Mining: Concepts and Techniques 51

Process of DW design n n DW can be built using top down, bottom

Process of DW design n n DW can be built using top down, bottom up approaches or both. When to choose top down approach: when business problem is clear and well understood. Bottom up approach starts with experiments and prototypes It is useful in early stages of business modeling It allows organization to move with less expense In combined approach it allows organization to support planned and strategic nature of top down and retaining rapid implementation of bottom up It is developed using waterfall and spiral model Water fall perform structured and systematic analysis at each step Spiral model involves short intervals between successive releases Spiral can be done quickly In spiral new designs & technologies can be adapted in a timely manner. 17 September 2020 Data Mining: Concepts and Techniques 52

Steps in DW design n n n n Step 1: Choose a business process

Steps in DW design n n n n Step 1: Choose a business process model(sales, shipmenta, account administration) If the process model is organizational and involves complex object collection DW model should be followed If the model is departmental and focus on analysis of one kind of business process data mart model is chosen Step 2: Choose the grain of business process The grain is atomic level of data represented in fact table. Step 3: Choose the dimensions that will apply for each fact table Step 4 : Choose measures that will populate each fact table record As DW design is complex and long term task scope has to be clearly defined various tools for DW design are DW development tools(which edit metadata repository, answer queries), planning and analysis tools (study the impact of schema changes, performance) 17 September 2020 Data Mining: Concepts and Techniques 53

Three tier DW architecture n n n n n Bottom tier data warehouse server:

Three tier DW architecture n n n n n Bottom tier data warehouse server: it is a relational database Back end tools are used to feel the data These tools perform the following Data extraction Data cleaning Data transformation To update DW the following function are used Load & refresh Data is extracted using API called gateways(ODBC, JDBC) this tier also has metadata repository which stores information about DW and its contents 17 September 2020 Data Mining: Concepts and Techniques 54

Extraction, Transformation, and Loading (ETL) n n n Data extraction n get data from

Extraction, Transformation, and Loading (ETL) n n n Data extraction n get data from multiple, heterogeneous, and external sources Data cleaning n detect errors in the data and rectify them when possible Data transformation n convert data from legacy or host format to warehouse format Load n sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions Refresh n propagate the updates from the data sources to the warehouse 55

Metadata Repository n Meta data is the data defining warehouse objects. It stores: n

Metadata Repository n Meta data is the data defining warehouse objects. It stores: n Description of the structure of the data warehouse n n schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents Operational meta-data n data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails) n The algorithms used for summarization n The mapping from operational environment to the data warehouse n n Data related to system performance n warehouse schema, view and derived data definitions Business data n business terms and definitions, ownership of data, charging policies 56

Three tier DW architecture n n Middle tier OLAP server: it is typically implemented

Three tier DW architecture n n Middle tier OLAP server: it is typically implemented using Relational OLAP(ROLAP) ROLAP: it extend relational dbms that maps operation on multidimensional operations to standard relational operations Multi dimensional OLAP(MOLAP): it is a special purpose server that directly implements multidimensional data and operations Top tier front end client layer: it contains query and reporting tools, analysis tools, data mining tools(trend analysis, prediction 17 September 2020 Data Mining: Concepts and Techniques 57

Data Warehouse: A Multi-Tiered Architecture Other sources Operational DBs Metadata Extract Transform Load Refresh

Data Warehouse: A Multi-Tiered Architecture Other sources Operational DBs Metadata Extract Transform Load Refresh Monitor & Integrator Data Warehouse OLAP Server Output Analysis Query Reports Data mining Data Marts Data Sources Bottom tier: DW server Middle tier Top tier: : OLAP server Front-End Tools 58

Three Data Warehouse Models n n Enterprise warehouse n collects all of the information

Three Data Warehouse Models n n Enterprise warehouse n collects all of the information about subjects spanning the entire organization. n It contains detailed data & summarized data n It may take years to design n It is implemented in mainframe systems Data Mart n It contains a subset of corporate-wide data that is of value to a specific groups of users. n Its scope is confined to specific, selected groups, such as marketing data mart n They are implemented on low cost servers n Implementation involves weeks, months 59

Three Data Warehouse Models Data marts can be divided Independent vs. dependent (directly from

Three Data Warehouse Models Data marts can be divided Independent vs. dependent (directly from warehouse) data mart n Independent data mart captures data generated directly form specific geographical location n Dependent data mart are sourced directly form enterprise DW Virtual warehouse n A set of views over operational databases n Only some of the possible summary views may be materialized n It is easy to built n It requires excess capacity n n 17 September 2020 Data Mining: Concepts and Techniques 60

Data Warehouse Development: A Recommended Approach Multi-Tier Data Warehouse Distributed Data Marts Data Mart

Data Warehouse Development: A Recommended Approach Multi-Tier Data Warehouse Distributed Data Marts Data Mart Model refinement Enterprise Data Warehouse Model refinement Define a high-level corporate data model 61

Data Warehouse Development: A Recommended Approach n n n n High level corporate model

Data Warehouse Development: A Recommended Approach n n n n High level corporate model is defined with reasonable short period(1/2 months) It provide consistent and integrated view among different subjects High level corporate model is refined to develop Enterprise DW Departmental data marts Independent data marts are implemented in parallel with enterprise DW Distributed data marts are constructed to integrate data marts via hub servers Finally multi tier DW is constructed 17 September 2020 Data Mining: Concepts and Techniques 62

Data Warehouse Usage n Three kinds of data warehouse applications n Information processing n

Data Warehouse Usage n Three kinds of data warehouse applications n Information processing n n n supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing n multidimensional analysis of data warehouse data n supports basic OLAP operations, slice-dice, drilling, pivoting Data mining n n knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools 63

DW Implementation n n n DW contains huge volumes of data It is crucial

DW Implementation n n n DW contains huge volumes of data It is crucial for DW system to support efficient cube computation techniques Multi dimensional analysis is the efficient computation of aggregation across many sets of dimensions Each group by is represented as cuboids Set of group-by form a lattice of cuboids/data cube It requires excessive storage space 17 September 2020 Data Mining: Concepts and Techniques 64

Efficient Data Cube Computation n Data cube can be viewed as a lattice of

Efficient Data Cube Computation n Data cube can be viewed as a lattice of cuboids n The bottom-most cuboid is the base cuboid n The top-most cuboid (apex) contains only one cell n How many cuboids in an n-dimensional cube with L levels? 65

The “Compute Cube” Operator n Cube definition and computation in DMQL define cube sales

The “Compute Cube” Operator n Cube definition and computation in DMQL define cube sales [item, city, year]: sum (sales_in_dollars) compute cube sales n Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al. ’ 96) () SELECT item, city, year, SUM (amount) FROM SALES (city) (item) (year) CUBE BY item, city, year (city, item) (city, year) (item, year) (city, item, year) 66

Data Cube Technology 67

Data Cube Technology 67

Data cube computation n It is a essential task in DW Precomputation of all

Data cube computation n It is a essential task in DW Precomputation of all parts of data cube can greatly reduce the response time and enhance the performance of OLAP But these computations require more time and storage 17 September 2020 Data Mining: Concepts and Techniques 68

Data Cube: A Lattice of Cuboids all time item time, location time, item 0

Data Cube: A Lattice of Cuboids all time item time, location time, item 0 -D(apex) cuboid location item, location time, supplier location, supplier item, supplier time, location, supplier time, item, locationtime, item, supplier 1 -D cuboids 2 -D cuboids 3 -D cuboids item, location, supplier 4 -D(base) cuboid time, item, location, supplierc 69

Data Cube Computation Methods n Multi-Way Array Aggregation n BUC n Star-Cubing n High-Dimensional

Data Cube Computation Methods n Multi-Way Array Aggregation n BUC n Star-Cubing n High-Dimensional OLAP 70

Multi-Way Array Aggregation n n It is also called as multi way Computes the

Multi-Way Array Aggregation n n It is also called as multi way Computes the full data cube by using multidimensional array as its basic data structure This approach is called as MOLAP Partition the array into chunks Chunk is a sub cube that is small enoufgh to fit into the memory available for cube computation Chunking is a method of dividing n-dimensional array into small ndimensional chunks Chunks are compressed so as to remove wasted space Compute the aggregates by visiting/accessing the valules at cube cells Order in which cells are visited I minimized to reduce memory access and storage This multi way aggregation computes aggregation simultaneously on multiple dimensions 17 September 2020 Data Mining: Concepts and Techniques 71

Multi-Way Array Aggregation n Array-based “bottom-up” algorithm n Using multi-dimensional chunks n No direct

Multi-Way Array Aggregation n Array-based “bottom-up” algorithm n Using multi-dimensional chunks n No direct tuple comparisons n n Simultaneous aggregation on multiple dimensions Intermediate aggregate values are re-used for computing ancestor cuboids 72

Multi-way Array Aggregation for Cube Computation (MOLAP) n Partition arrays into chunks (a small

Multi-way Array Aggregation for Cube Computation (MOLAP) n Partition arrays into chunks (a small subcube which fits in memory). n Compressed sparse array addressing: (chunk_id, offset) n Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. C c 3 61 62 63 64 c 2 45 46 47 48 c 1 29 30 31 32 c 0 B b 3 B 13 b 2 9 b 1 5 b 0 14 15 16 1 2 3 4 a 0 a 1 a 2 a 3 A 60 44 28 56 40 24 52 36 20 What is the best traversing order to do multi-way aggregation? 73

Multi-way Array Aggregation for Cube Computation (3 -D to 2 -D) n The best

Multi-way Array Aggregation for Cube Computation (3 -D to 2 -D) n The best order is the one that minimizes the memory requirement and reduced I/Os 74

Multi-way Array Aggregation for Cube Computation (2 -D to 1 -D) 75

Multi-way Array Aggregation for Cube Computation (2 -D to 1 -D) 75

Multi-Way Array Aggregation for Cube Computation (Method Summary) n Method: the planes should be

Multi-Way Array Aggregation for Cube Computation (Method Summary) n Method: the planes should be sorted and computed according to their size in ascending order n n Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Limitation of the method: computing well only for a small number of dimensions n If there a large number of dimensions, “top-down” computation and iceberg cube computation methods can be explored 76

Bottom-Up Computation (BUC) n n n n BUC is an algorithm for computation of

Bottom-Up Computation (BUC) n n n n BUC is an algorithm for computation of sparse and ice berg cubes BUC constructs the cube from apex cuboid towards the base cuboid This order of processing allows BUC to prune during construction using apriori property Apex cuboid is the most generalized or aggregated level 3 D base cuboid is the most detailed or specialized level This representation of lattice of cuboids with the apex at the top and base at the bottom is accepted in DW BUC views cuboid in reverse order with the apex at the bottom and base cuboid at the top. 17 September 2020 Data Mining: Concepts and Techniques 77

Algorithm : BUC Input n Input : the relation to aggregate n Dim :

Algorithm : BUC Input n Input : the relation to aggregate n Dim : the starting dimension for this iteration Globals: n Constant num. Dims: the total number of dimension n Constant cardinality(num. Dims): the cardinality of each dimension n Constant min_sup: the minimum number of tuples in a partition in order for it to be output; n output. Rec: the current output record: n data. Count(num. Dims): stores the size of each partition. Datacount(i) is a list of integers of size cardinality(i) 17 September 2020 Data Mining: Concepts and Techniques 78

Bottom-Up Computation (BUC) n n BUC (Beyer & Ramakrishnan, SIGMOD’ 99) Bottom-up cube computation

Bottom-Up Computation (BUC) n n BUC (Beyer & Ramakrishnan, SIGMOD’ 99) Bottom-up cube computation (Note: top-down in our view!) Divides dimensions into partitions and facilitates iceberg pruning n If a partition does not satisfy min_sup, its descendants can be pruned n If minsup = 1 Þ compute full CUBE! No simultaneous aggregation 79

BUC: Partitioning n n Usually, entire data set can’t fit in main memory Sort

BUC: Partitioning n n Usually, entire data set can’t fit in main memory Sort distinct values n partition into blocks that fit Continue processing Optimizations n Partitioning n External Sorting, Hashing, Counting Sort n Ordering dimensions to encourage pruning n Cardinality, Skew, Correlation n Collapsing duplicates n Can’t do holistic aggregates anymore! 80