Advanced Database Design and Implementation Dr Mohamed Ahmed
Advanced Database Design and Implementation Dr. Mohamed Ahmed Mohamed Ali University of Khartoum Faculty of Mathematical Sciences 18/4/2018 Advanced DB Design & Implementation 1
The Need for Data Analysis • Managers track daily transactions to evaluate how the business is performing • Strategies should be developed to meet organizational goals using operational databases • Data analysis provides information about short-term tactical evaluations and strategies 18/4/2018 Advanced DB Design & Implementation 2
Business Intelligence • Comprehensive, cohesive, integrated tools and processes – Capture, collect, integrate, store, and analyze data – Generate information to support business decision making • Framework that allows a business to transform: – Data into information – Information into knowledge – Knowledge into wisdom 18/4/2018 Advanced DB Design & Implementation 3
Business Intelligence Architecture • Composed of data, people, processes, technology, and management of components • Focuses on strategic and tactical use of information • Key performance indicators (KPI) – Measurements that assess company’s effectiveness or success in reaching goals • Multiple tools from different vendors can be integrated into a single BI framework 18/4/2018 Advanced DB Design & Implementation 4
18/4/2018 Advanced DB Design & Implementation 5
Business Intelligence Benefits • Main goal: improved decision making • Other benefits – Integrating architecture – Common user interface for data reporting and analysis – Common data repository fosters single version of company data – Improved organizational performance 18/4/2018 Advanced DB Design & Implementation 6
Business Intelligence Evolution 18/4/2018 Advanced DB Design & Implementation 7
18/4/2018 Advanced DB Design & Implementation 8
Business Intelligence Technology Trends • • • Data storage improvements Business intelligence appliances Business intelligence as a service Big Data analytics Personal analytics 18/4/2018 Advanced DB Design & Implementation 9
Decision Support Data • BI effectiveness depends on quality of data gathered at operational level • Operational data seldom well-suited for decision support tasks • Need reformat data in order to be useful for business intelligence 18/4/2018 Advanced DB Design & Implementation 10
Operational Data vs. Decision Support Data • Operational data – Mostly stored in relational database – Optimized to support transactions representing daily operations • Decision support data differs from operational data in three main areas: – Time span – Granularity – Dimensionality 18/4/2018 Advanced DB Design & Implementation 11
18/4/2018 Advanced DB Design & Implementation 12
Decision Support Database Requirements • Specialized DBMS tailored to provide fast answers to complex queries • Three main requirements – Database schema – Data extraction and loading – End-User Analytical Interface – Database size 18/4/2018 Advanced DB Design & Implementation 13
Decision Support Database Requirements (cont’d. ) • Database schema – Complex data representations – Aggregated and summarized data – Queries extract multidimensional time slices • Data extraction and filtering – Supports different data sources • Flat files • Hierarchical, network, and relational databases • Multiple vendors – Checking for inconsistent data Advanced DB Design & Implementation 18/4/2018 14
Decision Support Database Requirements (cont’d. ) • End-User Analytical Interface – Support advanced data-modeling and data presentation tools – Permits the user to navigate through the data to simplify and accelerate the decision-making process • Database size – In 2008, Wal-Mart had 4 petabytes of data in its data warehouses – DBMS must support very large databases (VLDBs) – Support multiple-processor technologies : SMP symmetric multiprocessor 18/4/2018 Advanced DB Design & Implementation 15
The Data Warehouse • Integrated, subject-oriented, time-variant, and nonvolatile collection of data – Provides support for decision making • Usually a read-only database optimized for data analysis and query processing • Requires time, money, and considerable managerial effort to create 18/4/2018 Advanced DB Design & Implementation 16
The Data Warehouse 18/4/2018 Advanced DB Design & Implementation 17
18/4/2018 Advanced DB Design & Implementation 18
Data Marts • • Small, single-subject data warehouse subset More manageable data set than data warehouse Provides decision support to small group of people Typically lower cost and lower implementation time than data warehouse 18/4/2018 Advanced DB Design & Implementation 19
12 Rules That Define a Data Warehouse 18/4/2018 Advanced DB Design & Implementation 20
Warehouse Models & Operators • Data Models – relations – stars & snowflakes – cubes • Operators – slice & dice – roll-up, drill down – pivoting – other 18/4/2018 Advanced DB Design & Implementation 21
Multi-Dimensional Data • Measures - numerical (and additive) data being tracked in business, can be analyzed and examined • Dimensions - business parameters that define a transaction, relatively static data such as lookup or reference tables • Example: Analyst may want to view sales data (measure) by geography, by time, and by product (dimensions) 18/4/2018 Advanced DB Design & Implementation 22
The Multi-Dimensional Model “Sales by product line over the past six months” “Sales by store between 1990 and 1995” Store Info Key columns joining fact table Numerical Measures to dimension tables Prod Code Time Code Store Code Sales Fact table for measures Product Info Dimension tables Qty Time Info . . . 18/4/2018 Advanced DB Design & Implementation 23
Multidimensional Modeling • Multidimensional modeling is a technique for structuring data around the business concepts • ER models describe “entities” and “relationships” • Multidimensional models describe “measures” and “dimensions” 18/4/2018 Advanced DB Design & Implementation 24
Dimensional Modeling • Dimensions are organized into hierarchies – E. g. , Time dimension: days weeks quarters – E. g. , Product dimension: product line brand • Dimensions have attributes Time Date Month Year 18/4/2018 Store. ID City State Country Region Advanced DB Design & Implementation 25
Dimension Hierarchies Store Dimension Product Dimension Total Region District Stores 18/4/2018 Advanced DB Design & Implementation Manufacturer Brand Products 26
Schema Design • Most data warehouses use a star schema to represent multi-dimensional model. • Each dimension is represented by a dimension table that describes it. • A fact table connects to all dimension tables with a multiple join. Each tuple in the fact table consists of a pointer to each of the dimension tables that provide its multi -dimensional coordinates and stores measures for those coordinates. • The links between the fact table in the center and the dimension tables in the extremities form a shape like a star. 18/4/2018 Advanced DB Design & Implementation 27
Star Schema (in RDBMS) 18/4/2018 Advanced DB Design & Implementation 28
Star Schema Example 18/4/2018 Advanced DB Design & Implementation 29
Star Schema with Sample Data 18/4/2018 Advanced DB Design & Implementation 30
The “Classic” Star Schema A relational model with a one-to-many relationship between dimension table and fact table. A single fact table, with detail and summary data Fact table primary key has only one key column per dimension Each dimension is a single table, highly denormalized • Benefits: Easy to understand, intuitive mapping between the business entities, easy to define hierarchies, reduces # of physical joins, low maintenance, very simple metadata • Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem Advanced DB Design & Implementation 18/4/2018 31
Need for Aggregates • Sizes of typical tables: – Time dimension: 5 years x 365 days = 1825 – Store dimension: 300 stores reporting daily sales – Production dimension: 40, 000 products in each store (about 4000 sell in each store daily) – Maximum number of base fact table records: 2 billion (lowest level of detail) • A query involving 1 brand, all store, 1 year: retrieve/summarize over 7 million fact table rows. 18/4/2018 Advanced DB Design & Implementation 32
Aggregating Fact Tables • Aggregate fact tables are summaries of the most granular data at higher levels along the dimension hierarchies. hy c r a r Hie ls leve 18/4/2018 Product key Product Category Department Time key Date Month Quarter Year Product key Time key Store key Unit sales Sale dollars Store key Store name Territory Region Multi-way aggregates: Territory – Category – Month (Data values at higher level) Advanced DB Design & Implementation 33
The “Fact Constellation” Schema District Fact Table District_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price Advanced DB Design & Implementation 18/4/2018 Region Fact Table Region_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price 34
Aggregate Fact Tables Product key Product Category Department Time key Date Month Quarter Year 18/4/2018 Base table Sales facts Product key Time key Store key Unit sales Sale dollars Store key Store name Territory Region Dimension One-way aggregate Derived from Product Category Sale facts Category key Time key Store key Unit sales Sales dollars Advanced DB Design & Implementation Category key Category Department 35
Families of Stars Dimension table Fact table Dimension table 18/4/2018 Dimension table Advanced DB Design & Implementation Dimension table 36
Snowflake Schema • Snowflake schema is a type of star schema but a more complex model. • “Snowflaking” is a method of normalizing the dimension tables in a star schema. • The normalization eliminates redundancy. • The result is more complex queries and reduced query performance. 18/4/2018 Advanced DB Design & Implementation 37
Sales: Snowflake Schema Brand key Brand name Category key Product category Product key Product name Product code Brand key Product Region key Region name Sales fact Product key Time key Customer key …. Territory key Territory name Region key Salesrep key Salesperson name Territory key Salesrep 18/4/2018 Advanced DB Design & Implementation 38
Snowflaking • The attributes with low cardinality in each original dimension table are removed to form separate tables. These new tables are linked back to the original dimension table through artificial keys. Product key Product name Product code Brand key 18/4/2018 Brand key Brand name Category key Advanced DB Design & Implementation Category key Product category 39
Snowflake Schema • Advantages: – Small saving in storage space – Normalized structures are easier to update and maintain • Disadvantages: – Schema less intuitive and end-users are put off by the complexity – Ability to browse through the contents difficult – Degrade query performance because of additional joins 18/4/2018 Advanced DB Design & Implementation 40
What is the Best Design? • Performance benchmarking can be used to determine what is the best design. • Snowflake schema: easier to maintain dimension tables when dimension tables are very large (reduce overall space). It is not generally recommended in a data warehouse environment. • Star schema: more effective for data cube browsing (less joins): can affect performance. 18/4/2018 Advanced DB Design & Implementation 41
Aggregates · Add up amounts for day 1 · In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81 18/4/2018 Advanced DB Design & Implementation 42
Aggregates · Add up amounts by day · In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date 18/4/2018 Advanced DB Design & Implementation 43
Another Example · Add up amounts by day, product · In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prod. Id rollup drill-down 18/4/2018 Advanced DB Design & Implementation 44
Aggregates • Operators: sum, count, max, min, median, ave • “Having” clause • Using dimension hierarchy – average by region (within store) – maximum by month (within date) 18/4/2018 Advanced DB Design & Implementation 45
Data Cube Fact table view: Multi-dimensional cube: dimensions = 2 18/4/2018 Advanced DB Design & Implementation 46
3 -D Cube Fact table view: Multi-dimensional cube: day 2 day 1 dimensions = 3 18/4/2018 Advanced DB Design & Implementation 47
Example roll-up to region e NY Product St or SF LA Juice Milk Coke Cream Soap Bread 10 34 56 32 12 56 M T W Th F S S Dimensions: Time, Product, Store roll-up to brand Attributes: Product (upc, price, …) Store … … Hierarchies: Product Brand … Day Week Quarter roll-up to week Store Region Country Time 56 units of bread sold in LA on M 18/4/2018 Advanced DB Design & Implementation 48
Cube Aggregation: Roll-up Example: computing sums. . . day 2 day 1 129 rollup drill-down 18/4/2018 Advanced DB Design & Implementation 49
Cube Operators for Roll-up day 2 . . . day 1 sale(s 1, *, *) 129 sale(s 2, p 2, *) 18/4/2018 sale(*, *, *) Advanced DB Design & Implementation 50
Extended Cube * day 2 sale(*, p 2, *) day 1 18/4/2018 Advanced DB Design & Implementation 51
Aggregation Using Hierarchies store day 2 day 1 region country (store s 1 in Region A; stores s 2, s 3 in Region B) 18/4/2018 Advanced DB Design & Implementation 52
Slicing day 2 day 1 TIME = day 1 18/4/2018 Advanced DB Design & Implementation 53
Slicing & Pivoting 18/4/2018 Advanced DB Design & Implementation 54
Summary of Operations • Aggregation (roll-up) – aggregate (summarize) data to the next higher dimension element – e. g. , total sales by city, year total sales by region, year • Navigation to detailed data (drill-down) • Selection (slice) defines a subcube – e. g. , sales where city =‘Gainesville’ and date = ‘ 1/15/90’ • Calculation and ranking – e. g. , top 3% of cities by average income • Visualization operations (e. g. , Pivot) • Time functions – e. g. , time average 18/4/2018 Advanced DB Design & Implementation 55
Query & Analysis Tools • • • Query Building Report Writers (comparisons, growth, graphs, …) Spreadsheet Systems Web Interfaces Data Mining 18/4/2018 Advanced DB Design & Implementation 56
Implementation of OLAP Server • ROLAP: relational OLAP – data are stored in tables in relational databases or extended-relational databases. They use an RDBMS to manage the warehouse data and aggregations using often a star schema. • They support extensions to SQL. • A cell in the multi-dimensional structure is represented by a tuple. • Advantage: scalable (no empty cells for sparse cube). • Disadvantage: no direct access to cells. 18/4/2018 Advanced DB Design & Implementation 57
Implementation of OLAP Server • MOLAP: multidimensional OLAP – implements the multidimensional view by storing data in special multidimensional data structure (MDDS). • Advantage: fast indexing to pre-computed aggregations. Only values are stored. • Disadvantage: not very scalable and sparse. 18/4/2018 Advanced DB Design & Implementation 58
Data Analytics • Subset of BI functionality • Encompasses a wide range of mathematical, statistical, and modeling techniques – Purpose of extracting knowledge from data • Tools can be grouped into two separate areas: – Explanatory analytics – Predictive analytics 18/4/2018 Advanced DB Design & Implementation 59
Data Mining • Data-mining tools do the following: – Analyze data – Uncover problems or opportunities hidden in data relationships – Form computer models based on their findings – Use models to predict business behavior • Runs in two modes – Guided – Automated 18/4/2018 Advanced DB Design & Implementation 60
18/4/2018 Advanced DB Design & Implementation 61
Predictive Analytics • Employs mathematical and statistical algorithms, neural networks, artificial intelligence, and other advanced modeling tools • Create actionable predictive models based on available data • Models are used in areas such as: – Customer relationships, customer service, customer retention, fraud detection, targeted marketing, and optimized pricing 18/4/2018 Advanced DB Design & Implementation 62
Online Analytical Processing • Three main characteristics: – Multidimensional data analysis techniques – Advanced database support – Easy-to-use end-user interfaces 18/4/2018 Advanced DB Design & Implementation 63
Multidimensional Data Analysis Techniques • Data are processed and viewed as part of a multidimensional structure • Augmented by the following functions: – Advanced data presentation functions – Advanced data aggregation, consolidation, and classification functions – Advanced computational functions – Advanced data modeling functions 18/4/2018 Advanced DB Design & Implementation 64
Advanced Database Support • Advanced data access features include: – Access to many different kinds of DBMSs, flat files, and internal and external data sources – Access to aggregated data warehouse data – Advanced data navigation – Rapid and consistent query response times – Maps end-user requests to appropriate data source and to proper data access language – Support for very large databases 18/4/2018 Advanced DB Design & Implementation 65
Easy-to-Use End-User Interface • Advanced OLAP features are more useful when access is simple • Many interface features are “borrowed” from previous generations of data analysis tools – Already familiar to end users – Makes OLAP easily accepted and readily used 18/4/2018 Advanced DB Design & Implementation 66
OLAP Architecture • Three main architectural components: – Graphical user interface (GUI) – Analytical processing logic – Data-processing logic 18/4/2018 Advanced DB Design & Implementation 67
OLAP Architecture (cont’d. ) • Designed to use both operational and data warehouse data • In most implementations, data warehouse and OLAP are interrelated and complementary • OLAP systems merge data warehouse and data mart approaches 18/4/2018 Advanced DB Design & Implementation 68
18/4/2018 Advanced DB Design & Implementation 69
Relational OLAP • Relational online analytical processing (ROLAP) provides the following extensions: – Multidimensional data schema support within the RDBMS – Data access language and query performance optimized for multidimensional data – Support for very large databases (VLDBs) 18/4/2018 Advanced DB Design & Implementation 70
Multidimensional OLAP • Multidimensional online analytical processing (MOLAP) extends OLAP functionality to multidimensional database management systems (MDBMSs) – MDBMS end users visualize stored data as a 3 D data cube – Data cubes can grow to n dimensions, becoming hypercubes – To speed access, data cubes are held in memory in a cube cache 18/4/2018 Advanced DB Design & Implementation 71
Relational vs. Multidimensional OLAP • Selection of one or the other depends on evaluator’s vantage point • Proper evaluation must include supported hardware, compatibility with DBMS, etc. • ROLAP and MOLAP vendors working toward integration within unified framework • Relational databases use star schema design to handle multidimensional data 18/4/2018 Advanced DB Design & Implementation 72
18/4/2018 Advanced DB Design & Implementation 73
SQL Extensions for OLAP • • Proliferation of OLAP tools fostered development of SQL extensions Many innovations have become part of standard SQL All SQL commands will work in data warehouse as expected Most queries include many data groupings and aggregations over multiple columns 18/4/2018 Advanced DB Design & Implementation 74
END 18/4/2018 Advanced DB Design & Implementation 75
- Slides: 75