Data Wharehousing OLAP and Data Mining 1 Acknowledgments

Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri 2

Overview z. Part 1: 2: 3: 4: Data Warehouses OLAP Data Mining Query Processing

Data, Data everywhere yet. . . z I can’t find the data I need

What is a Data Warehouse? A single, complete and consistent store of data obtained

Why Data Warehousing? Which are our lowest/highest margin customers ? Who are my customers

Decision Support z. Used to manage and control business z. Data is historical or

Evolution of Decision Support z 60’s: Batch reports yhard to find analyze information yinflexible

What are the users saying. . . z Data should be integrated across the

Data Warehousing -It is a process z Technique for assembling and managing data from

Traditional RDBMS used for OLTP z. Database Systems have been used traditionally for OLTP

OLTP vs Data Warehouse z OLTP y. Application Oriented y. Used to run business

Data Warehouse Architecture Relational Databases Legacy Data Purchased Data Optimized Loader Extraction Cleansing Data

From the Data Warehouse to Data Marts Information Less Individually Structured History Normalized Detailed

Users have different views of Data OLAP Tourists: Browse information harvested by farmers Farmers:

Wal*Mart Case Study z. Founded by Sam Walton z. One the largest Super Market

Old Retail Paradigm z Wal*Mart y. Inventory Management y. Merchandise Accounts Payable y. Purchasing

New (Just-In-Time) Retail Paradigm z No more deals z Shelf-Pass Through (POS Application) y

Information as a Strategic Weapon z Daily Summary of all Sales Information z Regional

Schema Design z. Database organization ymust look like business ymust be recognizable by business

Star Schema z A single fact table and for each dimension one dimension table

Dimension Tables z. Dimension tables y. Define business in terms already familiar to users

Fact Table z. Central table y. Typical example: individual sales records ymostly raw numeric

Snowflake schema z Represent dimensional hierarchy directly by normalizing tables. z Easy to maintain

Fact Constellation z. Fact Constellation y. Multiple fact tables that share many dimension tables

Data Granularity in Warehouse z. Summarized data stored yreduce storage costs yreduce cpu usage

Granularity in Warehouse z. Solution is to have dual level of granularity y. Store

Levels of Granularity Banking Example Operational account activity date amount teller location account bal

Data Integration Across Sources Savings Same data different name Loans Different data Same name

Data Transformation Operational/ Source Data Sequential Data Accessing Transformation Reconciling Legacy Capturing Conditioning Relational

Data Integrity Problems z Same person, different spellings y. Agarwal, Agrawal, Aggarwal etc. .

Data Transformation Terms z. Extracting z. Conditioning z. Scrubbing z. Merging z. Householding z.

Data Transformation Terms z. Householding y. Identifying all members of a household (living at

Refresh z. Propagate updates on source data to the warehouse z. Issues: ywhen to

When to Refresh? zperiodically (e. g. , every night, every week) or after significant

Refresh techniques z. Incremental techniques ydetect changes on base tables: replication servers (e. g.

How To Detect Changes z. Create a snapshot log table to record ids of

Querying Data Warehouses z. SQL Extensions z. Multidimensional modeling of data y. OLAP y.

SQL Extensions z. Extended family of aggregate functions yrank (top 10 customers) ypercentile (top

Reporting Tools z Andyne Computing -- GQL z Brio -- Brio. Query z Business

Decision support tools Direct Query Reporting tools Crystal reports Merge Clean Summarize Detailed transactional

Deploying Data Warehouses z What business information keeps you in business today? What business

Cultural Considerations z Not just a technology project z New way of using information

User Training z. Users must have a higher level of IT proficiency than for

Warehouse Products z Computer Associates -- CA-Ingres z Hewlett-Packard -- Allbase/SQL z Informix --

Nature of OLAP Analysis z. Aggregation -- (total sales, percent-tototal) z. Comparison -- Budget

Multi-dimensional Data z. Measure - sales (actual, plan, variance) Re gi on Dimensions: Product,

Conceptual Model for OLAP z. Numeric measures to be analyzed ye. g. Sales (Rs),

Operations z. Rollup: summarize data ye. g. , given sales data, summarize sales for

More Cube Operations z. Slice and dice: select and project ye. g. : Sales

More OLAP Operations z. Hypothesis driven search: E. g. factors affecting defaulters yview defaulting

MOLAP vs ROLAP z. MOLAP: Multidimensional array OLAP z. ROLAP: Relational OLAP 54

SQL Extensions z. Cube operator ygroup by on all subsets of a set of

OLAP: 3 Tier DSS Data Warehouse Database Layer Store atomic data in industry standard

Strengths of OLAP z It is a powerful visualization tool z It provides fast,

Brief History z Express and System W DSS z Online Analytical Processing - coined

OLAP and Executive Information Systems z Andyne Computing -- Pablo z Arbor Software --

Microsoft OLAP strategy z Plato: OLAP server: powerful, integrating various operational sources z OLE-DB

Why Data Mining z Credit ratings/targeted marketing: y Given a database of 100, 000

Data mining z. Process of semi-automatically analyzing large databases to find interesting and useful

Some basic operations z. Predictive: y. Regression y. Classification z. Descriptive: y. Clustering /

Classification z. Given old data about customers and payments, predict new applicant’s loan eligibility.

Classification methods Goal: Predict class Ci = f(x 1, x 2, . . Xn)

Decision trees z. Tree where internal nodes are simple decision rules on one or

Pros and Cons of decision trees • Pros + Reasonable training time + Fast

Neural network z. Set of nodes connected by directed weighted edges Basic NN unit

Pros and Cons of Neural Network • Pros + Can learn more complicated class

Bayesian learning z. Assume a probability model on generation of data. z. Apply bayes

Clustering z Unsupervised learning when old data with class labels not available e. g.

Association rules T Milk, cereal z Given set T of groups of items Tea,

Variants z. High confidence may not imply high correlation z. Use correlations. Find expected

Prevalent Interesting z Analysts already know about prevalent rules z Interesting rules are those

What makes a rule surprising? z. Does not match prior expectation y. Correlation between

Application Areas Industry Finance Insurance Telecommunication Transport Consumer goods Data Service providers Utilities Application

Data Mining in Use z The US Government uses Data Mining to track fraud

Why Now? z. Data is being produced z. Data is being warehoused z. The

Data Mining works with Warehouse Data z Data Warehousing provides the Enterprise with a

Mining market z Around 20 to 30 mining tool vendors z Major players: y.

OLAP Mining integration z. OLAP (On Line Analytical Processing) y. Fast interactive exploration of

State of art in mining OLAP integration z Decision trees [Information discovery, Cognos] yfind

Vertical integration: the web Mining on z Web log analysis for site design: ywhat

Slides: 84

Download presentation

Data Wharehousing, OLAP and Data Mining 1

Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri 2

Overview z. Part 1: 2: 3: 4: Data Warehouses OLAP Data Mining Query Processing and Optimization 3

Part 1: Data Warehouses 4

Data, Data everywhere yet. . . z I can’t find the data I need y data is scattered over the network y many versions, subtle differences z I can’t get the data I need y need an expert to get the data z I can’t understand the data I found y available data poorly documented z I can’t use the data I found y results are unexpected y data needs to be transformed from one form to other 5

What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand use in a business context. [Barry Devlin] 6

Why Data Warehousing? Which are our lowest/highest margin customers ? Who are my customers and what products are they buying? What is the most effective distribution channel? What product prom-otions have the biggest impact on revenue? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins? 7

Decision Support z. Used to manage and control business z. Data is historical or point-in-time z. Optimized for inquiry rather than update z. Use of the system is loosely defined and can be ad-hoc z. Used by managers and end-users to understand the business and make judgements 8

Evolution of Decision Support z 60’s: Batch reports yhard to find analyze information yinflexible and expensive, reprogram every request z 70’s: Terminal based DSS and EIS z 80’s: Desktop data access and analysis tools yquery tools, spreadsheets, GUIs yeasy to use, but access only operational db z 90’s: Data warehousing with integrated OLAP engines and tools 9

What are the users saying. . . z Data should be integrated across the enterprise z Summary data had a real value to the organization z Historical data held the key to understanding data over time z What-if capabilities are required 10

Data Warehousing -It is a process z Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible z A decision support database maintained separately from the organization’s operational database 11

Traditional RDBMS used for OLTP z. Database Systems have been used traditionally for OLTP yclerical data processing tasks ydetailed, up to date data ystructured repetitive tasks yread/update a few records yisolation, recovery and integrity are critical z. Will call these operational systems 12

OLTP vs Data Warehouse z OLTP y. Application Oriented y. Used to run business y. Clerical User y. Detailed data y. Current up to date y. Isolated Data y. Repetitive access by small transactions y. Read/Update access z Warehouse (DSS) y. Subject Oriented y. Used to analyze business y. Manager/Analyst y. Summarized and refined y. Snapshot data y. Integrated Data y. Ad-hoc access using large queries y. Mostly read access (batch update) 13

Data Warehouse Architecture Relational Databases Legacy Data Purchased Data Optimized Loader Extraction Cleansing Data Warehouse Engine Analyze Query Metadata Repository 14

From the Data Warehouse to Data Marts Information Less Individually Structured History Normalized Detailed Departmentally Structured Organizationally Structured Data Warehouse More Data 15

Users have different views of Data OLAP Tourists: Browse information harvested by farmers Farmers: Harvest information from known access paths Organizationally structured Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data 16

Wal*Mart Case Study z. Founded by Sam Walton z. One the largest Super Market Chains in the US z. Wal*Mart: 2000+ Retail Stores z. SAM's Clubs 100+Wholesalers Stores x. This case study is from Felipe Carino’s (NCR Teradata) presentation made at Stanford Database Seminar 17

Old Retail Paradigm z Wal*Mart y. Inventory Management y. Merchandise Accounts Payable y. Purchasing y. Supplier Promotions: National, Region, Store Level z Suppliers y. Accept Orders y. Promote Products y. Provide special Incentives y. Monitor and Track The Incentives y. Bill and Collect Receivables y. Estimate Retailer Demands 18

New (Just-In-Time) Retail Paradigm z No more deals z Shelf-Pass Through (POS Application) y One Unit Price x. Suppliers paid once a week on ACTUAL items sold y Wal*Mart Manager x. Daily Inventory Restock x. Suppliers (sometimes Same. Day) ship to Wal*Mart z Warehouse-Pass Through y Stock some Large Items x. Delivery may come from supplier y Distribution Center x. Supplier’s merchandise unloaded directly onto Wal*Mart Trucks 19

Information as a Strategic Weapon z Daily Summary of all Sales Information z Regional Analysis of all Stores in a logical area z Specific Product Sales z Specific Supplies Sales z Trend Analysis, etc. z Wal*Mart uses information when negotiating with y. Suppliers y. Advertisers etc. 20

Schema Design z. Database organization ymust look like business ymust be recognizable by business user yapproachable by business user y. Must be simple z. Schema Types y. Star Schema y. Fact Constellation Schema y. Snowflake schema 21

Star Schema z A single fact table and for each dimension one dimension table z Does not capture hierarchies directly T i m e c u s t date, custno, prodno, cityname, sales f a c t p r o d c i t y 22

Dimension Tables z. Dimension tables y. Define business in terms already familiar to users y. Wide rows with lots of descriptive text y. Small tables (about a million rows) y. Joined to fact table by a foreign key yheavily indexed ytypical dimensions xtime periods, geographic region (markets, cities), products, customers, salesperson, etc. 23

Fact Table z. Central table y. Typical example: individual sales records ymostly raw numeric items ynarrow rows, a few columns at most ylarge number of rows (millions to a billion) y. Access via dimensions 24

Snowflake schema z Represent dimensional hierarchy directly by normalizing tables. z Easy to maintain and saves storage T i m e c u s t p r o d date, custno, prodno, cityname, . . . f a c t c i t y r e g i o 25 n

Fact Constellation z. Fact Constellation y. Multiple fact tables that share many dimension tables y. Booking and Checkout may share many dimension tables in the hotel industry Hotels Travel Agents Booking Promotion Checkout Room Type Customer 26

Data Granularity in Warehouse z. Summarized data stored yreduce storage costs yreduce cpu usage yincreases performance since smaller number of records to be processed ydesign around traditional high level reporting needs ytradeoff with volume of data to be stored and detailed usage of data 27

Granularity in Warehouse z. Solution is to have dual level of granularity y. Store summary data on disks x 95% of DSS processing done against this data y. Store detail on tapes x 5% of DSS processing against this data 28

Levels of Granularity Banking Example Operational account activity date amount teller location account bal account month # trans withdrawals monthly account deposits register -- up to average bal 10 years 60 days of activity Not all fields need be archived amount activity date amount account bal 29

Data Integration Across Sources Savings Same data different name Loans Different data Same name Trust Data found here nowhere else Credit card Different keys same data 30

Data Transformation Operational/ Source Data Sequential Data Accessing Transformation Reconciling Legacy Capturing Conditioning Relational Extracting Loading Householding Validating External Filtering Scoring z. Data transformation is the foundation for achieving single version of the truth z. Major concern for IT z. Data warehouse can fail if appropriate data transformation strategy is not developed 31

Data Integrity Problems z Same person, different spellings y. Agarwal, Agrawal, Aggarwal etc. . . z Multiple ways to denote company name y. Persistent Systems, PSPL, Persistent Pvt. LTD. z Use of different names ymumbai, bombay z Different account numbers generated by different applications for the same customer z Required fields left blank z Invalid product codes collected at point of sale ymanual entry leads to mistakes y“in case of a problem use 9999999” 32

Data Transformation Terms z. Extracting z. Conditioning z. Scrubbing z. Merging z. Householding z. Enrichment z. Scoring z. Loading z. Validating z. Delta Updating 33

Data Transformation Terms z. Householding y. Identifying all members of a household (living at the same address) y. Ensures only one mail is sent to a household y. Can result in substantial savings: 1 million catalogues at $50 each costs $50 million. A 2% savings would save $1 million 34

Refresh z. Propagate updates on source data to the warehouse z. Issues: ywhen to refresh yhow to refresh -- incremental refresh techniques 35

When to Refresh? zperiodically (e. g. , every night, every week) or after significant events zon every update: not warranted unless warehouse data require current data (up to the minute stock quotes) zrefresh policy set by administrator based on user needs and traffic zpossibly different policies for different sources 36

Refresh techniques z. Incremental techniques ydetect changes on base tables: replication servers (e. g. , Sybase, Oracle, IBM Data Propagator) xsnapshots (Oracle) xtransaction shipping (Sybase) ycompute changes to derived and summary tables ymaintain transactional correctness for incremental load 37

How To Detect Changes z. Create a snapshot log table to record ids of updated rows of source data and timestamp z. Detect changes by: y. Defining after row triggers to update snapshot log when source table changes y. Using regular transaction log to detect changes to source data 38

Querying Data Warehouses z. SQL Extensions z. Multidimensional modeling of data y. OLAP y. More on OLAP later … 39

SQL Extensions z. Extended family of aggregate functions yrank (top 10 customers) ypercentile (top 30% of customers) ymedian, mode y. Object Relational Systems allow addition of new aggregate functions z. Reporting features yrunning total, cumulative totals 40

Reporting Tools z Andyne Computing -- GQL z Brio -- Brio. Query z Business Objects -- Business Objects z Cognos -- Impromptu z Information Builders Inc. -- Focus for Windows z Oracle -- Discoverer 2000 z Platinum Technology -- SQL*Assist, Pro. Reports z Power. Soft -- Info. Maker z SAS Institute -- SAS/Assist z Software AG -- Esperant z Sterling Software -- VISION: Data 41

Decision support tools Direct Query Reporting tools Crystal reports Merge Clean Summarize Detailed transactional data OLAP Essbase Mining tools Intelligent Miner Relational DBMS+ e. g. Redbrick Data warehouse Operational data Bombay branch Delhi branch Oracle GIS data Calcutta branch IMS Census data SAS 42

Deploying Data Warehouses z What business information keeps you in business today? What business information can put you out of business tomorrow? z What business information should be a mouse click away? z What business conditions are the driving the need for business information? 43

Cultural Considerations z Not just a technology project z New way of using information to support daily activities and decision making z Care must be taken to prepare organization for change z Must have organizational backing and support 44

User Training z. Users must have a higher level of IT proficiency than for operational systems z. Training to help users analyze data in the warehouse effectively 45

Warehouse Products z Computer Associates -- CA-Ingres z Hewlett-Packard -- Allbase/SQL z Informix -- Informix, Informix XPS z Microsoft -- SQL Server z Oracle – Oracle z Red Brick -- Red Brick Warehouse z SAS Institute -- SAS z Software AG -- ADABAS z Sybase -- SQL Server, IQ, MPP 46

Part 2: OLAP 47

Nature of OLAP Analysis z. Aggregation -- (total sales, percent-tototal) z. Comparison -- Budget vs. Expenses z. Ranking -- Top 10, quartile analysis z. Access to detailed and aggregate data z. Complex criteria specification z. Visualization z Need interactive response to aggregate queries 48

Multi-dimensional Data z. Measure - sales (actual, plan, variance) Re gi on Dimensions: Product, Region, Time Hierarchical summarization paths Product W S N Juice Cola Milk Cream Toothpaste Soap 1 2 34 5 6 7 Month Product Industry Region Country Time Year Category Region Quarter Product City Office Month Day week 49

Conceptual Model for OLAP z. Numeric measures to be analyzed ye. g. Sales (Rs), sales (volume), budget, revenue, inventory z. Dimensions yother attributes of data, define the space ye. g. , store, product, date-of-sale yhierarchies on dimensions xe. g. branch -> city -> state 50

Operations z. Rollup: summarize data ye. g. , given sales data, summarize sales for last year by product category and region z. Drill down: get more details ye. g. , given summarized sales as above, find breakup of sales by city within each region, or within the Andhra region 51

More Cube Operations z. Slice and dice: select and project ye. g. : Sales of soft-drinks in Andhra over the last quarter z. Pivot: change the view of data y Q 1 Q 2 L S Total Red Blue Total L S Total 52

More OLAP Operations z. Hypothesis driven search: E. g. factors affecting defaulters yview defaulting rate on age aggregated over other dimensions yfor particular age segment detail along profession z Need interactive response to aggregate queries y=> precompute various aggregates 53

MOLAP vs ROLAP z. MOLAP: Multidimensional array OLAP z. ROLAP: Relational OLAP 54

SQL Extensions z. Cube operator ygroup by on all subsets of a set of attributes (month, city) yredundant scan and sorting of data can be avoided z. Various other non-standard SQL extensions by vendors 55

OLAP: 3 Tier DSS Data Warehouse Database Layer Store atomic data in industry standard Data Warehouse. OLAP Engine Decision Support Client Application Logic Layer Presentation Layer Generate SQL execution plans in the OLAP engine to obtain OLAP functionality. Obtain multidimensional reports from the DSS Client. 56

Strengths of OLAP z It is a powerful visualization tool z It provides fast, interactive response times z It is good for analyzing time series z It can be useful to find some clusters and outliners z Many vendors offer OLAP tools 57

Brief History z Express and System W DSS z Online Analytical Processing - coined by EF Codd in 1994 - white paper by Arbor Software z Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System z MOLAP: Multidimensional OLAP (Hyperion (Arbor Essbase), Oracle Express) z ROLAP: Relational OLAP (Informix Meta. Cube, Microstrategy DSS Agent) 58

OLAP and Executive Information Systems z Andyne Computing -- Pablo z Arbor Software -- Essbase z Cognos -- Power. Play z Comshare -- Commander OLAP z Holistic Systems -- Holos z Information Advantage -AXSYS, Web. OLAP z Informix -- Metacube z Microstrategies -DSS/Agent z Oracle -- Express z Pilot -- Light. Ship z Planning Sciences -Gentium z Platinum Technology -Prodea. Beacon, Forest & Trees z SAS Institute -- SAS/EIS, OLAP++ z Speedware -- Media 59

Microsoft OLAP strategy z Plato: OLAP server: powerful, integrating various operational sources z OLE-DB for OLAP: emerging industry standard based on MDX --> extension of SQL for OLAP z Pivot-table services: integrate with Office 2000 y. Every desktop will have OLAP capability. z Client side caching and calculations z Partitioned and virtual cube z Hybrid relational and multidimensional storage 60

Part 3: Data Mining 61

Why Data Mining z Credit ratings/targeted marketing: y Given a database of 100, 000 names, which persons are the least likely to default on their credit cards? y Identify likely responders to sales promotions z Fraud detection y Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? z Customer relationship management: y Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? : Data Mining helps extract such information 62

Data mining z. Process of semi-automatically analyzing large databases to find interesting and useful patterns z. Overlaps with machine learning, statistics, artificial intelligence and databases but ymore scalable in number of features and instances ymore automated to handle heterogeneous data 63

Some basic operations z. Predictive: y. Regression y. Classification z. Descriptive: y. Clustering / similarity matching y. Association rules and variants y. Deviation detection 64

Classification z. Given old data about customers and payments, predict new applicant’s loan eligibility. Previous customers Age Salary Profession Location Customer type Classifier Decision rules Salary > 5 L Prof. = Exec Good/ bad New applicant’s data 65

Classification methods Goal: Predict class Ci = f(x 1, x 2, . . Xn) z. Regression: (linear or any other polynomial) ya*x 1 + b*x 2 + c = Ci. z. Nearest neighour z. Decision tree classifier: divide decision space into piecewise constant regions. z. Probabilistic/generative models z. Neural networks: partition by non-linear boundaries 66

Decision trees z. Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels. Salary < 1 M Prof = teacher Good Bad Age < 30 Bad Good 67

Pros and Cons of decision trees • Pros + Reasonable training time + Fast application + Easy to interpret + Easy to implement + Can handle large number of features • Cons – Cannot handle complicated relationship between features – simple decision boundaries – problems with lots of missing data More information: http: //www. stat. wisc. edu/~limt/treeprogs. html 68

Neural network z. Set of nodes connected by directed weighted edges Basic NN unit x 1 x 2 x 3 w 1 w 2 w 3 A more typical NN x 1 x 2 x 3 Output nodes Hidden nodes 69

Pros and Cons of Neural Network • Pros + Can learn more complicated class boundaries + Fast application + Can handle large number of features • Cons – Slow training time – Hard to interpret – Hard to implement: trial and error for choosing number of nodes Conclusion: Use neural nets only if decision trees/NN fail. 70

Bayesian learning z. Assume a probability model on generation of data. z. Apply bayes theorem to find most likely class as: z. Naïve bayes: Assume attributes conditionally independent given class value 71

Clustering z Unsupervised learning when old data with class labels not available e. g. when introducing a new product. z Group/cluster existing customers based on time series of payment history such that similar customers in same cluster. z Key requirement: Need a good measure of similarity between instances. z Identify micro-markets and develop policies for each 72

Association rules T Milk, cereal z Given set T of groups of items Tea, milk z Example: set of item sets purchased Tea, rice, bread z Goal: find all rules on itemsets of the form a-->b such that y support of a and b > user threshold s yconditional probability (confidence) of b given a > user threshold c z Example: Milk --> bread z Purchase of product A --> service B cereal 73

Variants z. High confidence may not imply high correlation z. Use correlations. Find expected support and large departures from that interesting. . ysee statistical literature on contingency tables. z. Still too many rules, need to prune. . . 74

Prevalent Interesting z Analysts already know about prevalent rules z Interesting rules are those that deviate from prior expectation z Mining’s payoff is in finding surprising phenomena Zzzz. . . 1995 Milk and cereal sell together! 1998 Milk and cereal sell together! 75

What makes a rule surprising? z. Does not match prior expectation y. Correlation between milk and cereal remains roughly constant over time z. Cannot be trivially derived from simpler rules y. Milk 10%, cereal 10% y. Milk and cereal 10% … surprising y. Eggs 10% y. Milk, cereal and eggs 0. 1% … surprising! y. Expected 1% 76

Application Areas Industry Finance Insurance Telecommunication Transport Consumer goods Data Service providers Utilities Application Credit Card Analysis Claims, Fraud Analysis Call record analysis Logistics management promotion analysis Value added data Power usage analysis 77

Data Mining in Use z The US Government uses Data Mining to track fraud z A Supermarket becomes an information broker z Basketball teams use it to track game strategy z Cross Selling z Target Marketing z Holding on to Good Customers z Weeding out Bad Customers 78

Why Now? z. Data is being produced z. Data is being warehoused z. The computing power is available z. The computing power is affordable z. The competitive pressures are strong z. Commercial products are available 79

Data Mining works with Warehouse Data z Data Warehousing provides the Enterprise with a memory z Data Mining provides the Enterprise with intelligence 80

Mining market z Around 20 to 30 mining tool vendors z Major players: y. Clementine, y. IBM’s Intelligent Miner, y. SGI’s Mine. Set, y. SAS’s Enterprise Miner. z All pretty much the same set of tools z Many embedded products: fraud detection, electronic commerce applications 81

OLAP Mining integration z. OLAP (On Line Analytical Processing) y. Fast interactive exploration of multidim. aggregates. y. Heavy reliance on manual operations for analysis: y. Tedious and error-prone on large multidimensional data z Ideal platform for vertical integration of mining but needs to be interactive instead of batch. 82

State of art in mining OLAP integration z Decision trees [Information discovery, Cognos] yfind factors influencing high profits z Clustering [Pilot software] ysegment customers to define hierarchy on that dimension z Time series analysis: [Seagate’s Holos] y. Query for various shapes along time: eg. spikes, outliers etc z. Multi-level Associations [Han et al. ] yfind association between members of dimensions 83

Vertical integration: the web Mining on z Web log analysis for site design: ywhat are popular pages, ywhat links are hard to find. z Electronic stores sales enhancements: yrecommendations, advertisement: y. Collaborative filtering: Net perception, Wisewire y. Inventory control: what was a shopper looking for and could not find. . 84