Cours e outline s Data Mining and Data

  • Slides: 20
Download presentation
Cours e outline s Data Mining and Data Warehousing: Concepts and Techniques ÊMotivation ÊEvolution

Cours e outline s Data Mining and Data Warehousing: Concepts and Techniques ÊMotivation ÊEvolution of Database Technology overview ÊWhy Data Mining? — Potential Applications ÊWhat Is Data Mining? Data Mining: A KDD Process ÊData Mining: On What Kind of Data? Next What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP

Motivation Data explosion problem: Automated data collection tools and database technology lead to tremendous

Motivation Data explosion problem: Automated data collection tools and database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. Data are collected from everywhere and in huge amounts We are Data Rich but Information Poor How to make good use of your data? 2

We are Data Rich but Information Poor Databases are too big Data Mining can

We are Data Rich but Information Poor Databases are too big Data Mining can help discover knowledge Terrorbytes 3

Data warehousing and data mining - Overview On-line analytical processing (OLAP) Extraction of interesting

Data warehousing and data mining - Overview On-line analytical processing (OLAP) Extraction of interesting knowledge (rules, patterns, …) from data in large databases. ÊBring together scattered information from multiple sources as to provide a consistent database source for decision support queries. ÊProvide architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. 4

Evolution of Database Technology - overview 1960 s: Data collection, database creation, IMS and

Evolution of Database Technology - overview 1960 s: Data collection, database creation, IMS and network DBMS. 1970 s: Relational data model, relational DBMS implementation. 1980 s: RDBMS, advanced data models (extended-relational, OO, deductive, etc. ) and application-oriented DBMS (spatial, scientific, engineering, etc. ). 1990 s: Data mining and data warehousing, multimedia databases, and Web 5 technology.

Why Data Mining? — Potential Applications Database analysis and decision support ÊMarket analysis and

Why Data Mining? — Potential Applications Database analysis and decision support ÊMarket analysis and management Êtarget marketing, customer relation management, market basket analysis, cross selling, market segmentation. ÊRisk analysis and management ÊForecasting, customer retention, improved underwriting, quality control, competitive analysis. ÊFraud detection and management Other Applications: ÊText mining (news group, email, documents) and Web analysis. ÊIntelligent query answering 6

Market Analysis and Management Ê Where are the data sources for analysis? Credit card

Market Analysis and Management Ê Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies. Ê Target marketing: Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Ê Determine customer purchasing patterns over time: Ê Conversion of single to a joint bank account: marriage, etc. Ê Cross-market analysis Ê Associations/co-relations between product sales Ê Prediction based on the association information. 7

Market Analysis and Management (2) ÊCustomer profiling Êdata mining can tell you what types

Market Analysis and Management (2) ÊCustomer profiling Êdata mining can tell you what types of customers buy what products (clustering or classification). ÊIdentifying customer requirements Êidentifying the best products for different customers Êuse prediction to find what factors will attract new customers ÊProvides summary information Êvarious multidimensional summary reports; Êstatistical summary information (data central tendency and variation) 8

Corporate Analysis and Risk Management Ê Finance planning and asset evaluation: Ê cash flow

Corporate Analysis and Risk Management Ê Finance planning and asset evaluation: Ê cash flow analysis and prediction Ê contingent claim analysis to evaluate assets Ê cross-sectional and time series analysis (financial-ratio, trend analysis, etc. ) Ê Resource planning: summarize and compare the resources and spending Ê Competition: Ê monitor competitors and market directions (CI: competitive intelligence). Ê group customers into classes and a class-based pricing procedure. Ê set pricing strategy in a highly competitive market (e. g. , REPSOL gas chain station in Spain). 9

Fraud Detection and Management Ê Applications: Ê widely used in health care, retail, credit

Fraud Detection and Management Ê Applications: Ê widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Ê Approach: Ê use historical data to build models of fraudulent behavior and use data mining to help identify similar instances. Ê Examples: Ê auto insurance: detect a group of people who stage accidents to collect on insurance Ê money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) Ê medical insurance: detect professional patients and ring of doctors and ring of references 10

Fraud Detection and Management (2) ÊMore examples: ÊDetecting inappropriate medical treatment: ÊAustralian Health Insurance

Fraud Detection and Management (2) ÊMore examples: ÊDetecting inappropriate medical treatment: ÊAustralian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1 m/yr). ÊDetecting telephone fraud: ÊTelephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. ÊBritish Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. ÊRetail: Analysts estimate that 38% of retail shrink is due to dishonest employees. 11

Other Applications ÊSports ÊIBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and

Other Applications ÊSports ÊIBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat. ÊAstronomy ÊJPL and the Palomar Observatory discovered 22 quasars with the help of data mining ÊInternet Web Surf-Aid ÊIBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site 12

Data Mining Should Not be Used Blindly! ÊData mining find regularities from history, but

Data Mining Should Not be Used Blindly! ÊData mining find regularities from history, but history is not the same as the future. ÊAssociation does not dictate trend nor causality!? ÊDrink diet drinks lead to obesity! ÊDavid Heckerman’s counter-example (1997): ÊBarbecue sauce, hot dogs and hamburgers. ÊSome abnormal data could be caused by human! Ê37 C? Why not registered by doctors? 13

What Is Data Mining? (1/2) Ê Data mining (part of knowledge discovery in databases):

What Is Data Mining? (1/2) Ê Data mining (part of knowledge discovery in databases): Ê Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) information from data in large databases Ê Alternative names and their “inside stories”: Ê Data mining: a misnomer? Ê Knowledge Discovery in Databases (KDD: SIGKDD), knowledge extraction, data archeology, data dredging, information harvesting, business intelligence, etc. Ê What is not data mining? Ê (Deductive) query processing. Ê Expert systems or small statistical programs 14

Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Pattern

Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Pattern Knowledg Evaluation Data Mining Taskrelevant Data Warehouse Selection Data Cleaning Data Integration Databases 15

Knowle Pattern Evaluation Data Mining Taskrelevant Data Steps of a KDD Process Data Wareho

Knowle Pattern Evaluation Data Mining Taskrelevant Data Steps of a KDD Process Data Wareho use Data Clean ing Selection Data Integration Learning the application domain – relevant prior knowledge and goals of application Database s Data warehousing Ê Creating a target data set: data selection Ê Data cleaning and preprocessing: (may take 60% of effort!) Ê Data reduction and projection – Find useful features, dimensionality/variable reduction, invariant representation. Data mining Ê Choosing functions of data mining - summarization, classification, regression, association, clustering. Ê Choosing the mining algorithm(s) Ê Data mining: search for patterns of interest Ê Interpretation: analysis of results - visualization, transformation, removing redundant patterns, etc. Ê Use of discovered knowledge 16 d g e

Architecture: Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base

Architecture: Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Data cleaning & data integration Databases Filtering Data Warehouse 17

Data Mining and Business Intelligence Increasing potential to support business decisions Making Decisions End

Data Mining and Business Intelligence Increasing potential to support business decisions Making Decisions End User Data Presentation Visualization Techniques Business Analyst Data Mining Information Discovery Data Analyst Data Statistical Analysis, Querying and Reporting Exploration Data Warehouses / Data Marts OLAP, MDA Data Sources DBA Paper, Files, Information Providers, Database Systems, 18

Data Mining: Confluence of Multiple Disciplines Database systems, data warehouse and OLAP Statistics Machine

Data Mining: Confluence of Multiple Disciplines Database systems, data warehouse and OLAP Statistics Machine learning Visualization Information science High performance computing Other disciplines: Neural networks, mathematical modeling, information retrieval, pattern recognition, etc. 19

Data Mining: On What Kind of Data? Data mining is performed on data coming

Data Mining: On What Kind of Data? Data mining is performed on data coming from: ÊRelational databases ÊTransactional databases ÊAdvanced DB systems and information repositories Ê Object-oriented and object-relational databases Ê Spatial databases Ê Time-series data and temporal data Ê Text databases and multimedia databases Ê Heterogeneous and legacy databases Ê WWW … and accumulated in a data warehouse for long periods of time (several months or sometimes years) 20