Data Mining and Data Warehousing By Dr NK
Data Mining and Data Warehousing By Dr. NK SAKTHIVEL, Professor/ So. C SASTRA 10/2/2020 Data Mining: Concepts and Techniques 1
Evolution of Database Technology n 1960 s and Earlier : Hierarch Model n n 10/2/2020 Data collection and Processing database creation, and Primitive File 1970 s – Early 1980 s: Data-Base Management Systems n Network DBMS n Relational data model, relational DBMS implementation n Data Modeling Tool n Indexing, B++ Tree and Hashing n Query Languages n Transaction Management : Recovery, Concurrency Control n On-Line Transaction Processing OLTP Data Mining: Concepts and Techniques 2
Evolution of Database Technology n Mid 1980 s: n n 1990 s— 2000 s: Data Analysis and Understanding n 10/2/2020 RDBMS, advanced data models (extended-relational, OO, deductive, etc. ) and application-oriented DBMS (spatial, scientific, engineering, etc. ) Data mining and data warehousing, multimedia databases, and Web databases Data Mining: Concepts and Techniques 3
10/2/2020 Data Mining: Concepts and Techniques 4
Database Processing vs. Data Mining Processing n Query n Well defined n SQL n n Data – Operational data n Output – Not operational data n Output – Precise – Subset of database 10/2/2020 Query n Poorly defined n No precise query language – Fuzzy – Not a subset of database Data Mining: Concepts and Techniques 5
Query Examples n Database – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10, 000 in the last month. – Find all customers who have purchased milk n 10/2/2020 Data Mining – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules) Data Mining: Concepts and Techniques 6
What Is Data Mining? n Data mining - knowledge discovery in databases n n Alternative names and their “inside stories”: n n n Data mining: a misnomer? Knowledge Discovery(mining) in Databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. What is not data mining? n n 10/2/2020 Extraction of interesting (implicit, previously unknown and potentially useful) information or patterns from data in large databases (Deductive) query processing. Expert systems or small ML/statistical programs Data Mining: Concepts and Techniques 7
What Is Data Mining? n Gold Mining / Rock Mining / Sand Mining n Data Mining / Knowledge Mining 10/2/2020 Data Mining: Concepts and Techniques 8
Why Data Mining? — Potential Applications n Database analysis and decision support n Market analysis and management n n Risk analysis and management n n n target marketing, customer relation management, market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management Other Applications 10/2/2020 n Text mining (news group, email, documents) and Web analysis. n Intelligent query answering Data Mining: Concepts and Techniques 9
Market Analysis and Management (1) n Where are the data sources for analysis? n n Target marketing n n 10/2/2020 Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time n n Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Conversion of single to a joint bank account: marriage, etc. Cross-market analysis n Associations/co-relations between product sales n Prediction based on the association information Data Mining: Concepts and Techniques 10
Market Analysis and Management (2) n Customer profiling n data mining can tell you what types of customers buy what products (clustering or classification) n n Identifying customer requirements n identifying the best products for different customers n use prediction to find what factors will attract new customers Provides summary information n various multidimensional summary reports n statistical summary information (data central tendency and variation) 10/2/2020 Data Mining: Concepts and Techniques 11
Corporate Analysis and Risk Management n Finance planning and asset evaluation n n Resource planning: n n cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc. ) summarize and compare the resources and spending Competition: n n n 10/2/2020 monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market Data Mining: Concepts and Techniques 12
Fraud Detection and Management (1) n Applications n n Approach n n use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples n n n 10/2/2020 widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references Data Mining: Concepts and Techniques 13
Fraud Detection and Management (2) n Detecting inappropriate medical treatment n n Detecting telephone fraud n n n Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail n 10/2/2020 Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1 m/yr). Analysts estimate that 38% of retail shrink is due to dishonest employees. Data Mining: Concepts and Techniques 14
Other Applications n Sports n n Astronomy n n JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid n 10/2/2020 IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. Data Mining: Concepts and Techniques 15
Data Mining: A KDD Process Pattern Evaluation n Data mining: the core of knowledge discovery Data Mining process. Task-relevant Data Warehouse Selection Data Cleaning Data Integration Databases 10/2/2020 Data Mining: Concepts and Techniques 16
Steps of a KDD Process n Learning the application domain: n n Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: n n 10/2/2020 summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation n n Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining n n relevant prior knowledge and goals of application visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge Data Mining: Concepts and Techniques 17
Data Mining and Business Intelligence Increasing potential to support business decisions End User Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Business Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP 10/2/2020 Data Mining: Concepts and Techniques DBA 18
Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Database or data warehouse server Data cleaning & data integration Databases 10/2/2020 Knowledge-base Filtering Data Warehouse Data Mining: Concepts and Techniques 19
Data Mining: On What Kind of Data? n n Relational databases – Interrelated Operations Data warehouses - Collecting information from different Dbases Transactional databases – File / Table Transaction Advanced DB Systems and their applications n n n n 10/2/2020 Object-oriented and object-relational databases Spatial databases - from satellite Time-series data-base and – Sequence of time ( Market) Temporal data-base - Information with Timestamp Text databases and multimedia databases – Objects and Images Heterogeneous and legacy databases – network databases WWW Data Mining: Concepts and Techniques 20
Data Mining Functionalities Perform Inference for Prediction 10/2/2020 Data Mining: Concepts and Techniques General Properties 21
Data Mining Functionalities (1) n Concept description: Characterization and discrimination n n Association (correlation and causality) n Link Analysis or Association n Multi-dimensional vs. single-dimensional association n X is a Customer and T is a Transaction n 10/2/2020 Generalize, summarize, and contrast data characteristics, e. g. , dry vs. wet regions ( further precise/accurate is possible ) age(X, “ 20. . 29”) ^ income(X, “ 20. . 29 K”) àbuys(X, “PC”) [support = 2%, confidence = 60%] n Support means % of transaction and n Confidence is degree of certainty of the detected association n contains(T, “computer”) àcontains(T, “software”) [1%, 50%] n Bread with Pretzels is 60% and Bread with Jelly is 70% Data Mining: Concepts and Techniques 22
Data Mining Functionalities (2) n Classification and Prediction (Supervised ) n n n E. g. , classify countries based on climate, or classify cars based on mileage n Presentation: decision-tree, classification rule, neural network n Prediction: Predict some unknown or missing numerical values Pattern Matching n 10/2/2020 Finding Groups / classes / models (functions) that describe and distinguish classes or concepts for future prediction Credit card company must determine to authorize the credit card purchase n i. Authorize ii. Ask for further identification before authorize n iii. Do not authorize iv. Do not authorize and contact police Data Mining: Concepts and Techniques 23
Data Mining Functionalities (2) Cluster analysis (Unsupervised ) n n n 10/2/2020 Pattern Analysis, Data Analysis, Image Processing and Market Research By clustering, one can identify dense and discoverall distribution patterns Early in childhood, one learns how to distinguish between cats and dogs or animals and plants Class label is unknown: Group data to form new classes, e. g. , cluster houses to find distribution patterns Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity Data Mining: Concepts and Techniques 24
Data Mining Functionalities (3) n Outlier analysis n (used to identify error) Outlier: a data object that does not comply (fulfill) with the general behavior of the data n n For example, the program output as n Age : -18 n Salary of the chief executive officer of a company Data Mining algorithms try to minimize the influence of outliers or eliminate them all together. n It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis 10/2/2020 Data Mining: Concepts and Techniques 25
Data Mining Functionalities (3) n Evolution analysis n It describes models regularities or rends for objects, whose behavior changes over time n It includes Association, Classification or clustering of Time. Related data. However, distinct features are n Time-series data analysis (Stock Market Analysis for last several years to invest shares of high-tech industrial company) 10/2/2020 n Sequence or periodicity pattern matching n Similarity-based data analysis Data Mining: Concepts and Techniques 26
Are All of the Patterns Interesting n Data Mining system can generate thousands of Patterns or rules 1. What make patterns interesting? 2. Can a data mining system generate all of the interesting patterns? 3. 10/2/2020 Can a data mining system generate only interesting patterns? Data Mining: Concepts and Techniques 27
Are All of the Patterns Interesting n n What make patterns interesting? n Interesting means knowledge n Hypothesis become confirmed n Support and confidence Can a data mining system generate all of the interesting patterns? 10/2/2020 n Need completeness of the DMAlgorithm n Depends on association Rule Data Mining: Concepts and Techniques 28
Mining Association Rules—An Example Min. support 50% Min. confidence 50% For rule A � C: support = support({A �C}) = 50% confidence = support({A �C})/support({A}) = 66. 6% 10/2/2020 Data Mining: Concepts and Techniques 29
Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Information Science Statistics Data Mining Visualization Other Disciplines Fuzzy/Neural/GA 10/2/2020 Data Mining: Concepts and Techniques 30
Data Mining Development • Relational Data Model • SQL • Association Rule Algorithms • Data Warehousing • Scalability Techniques • Similarity Measures • Hierarchical Clustering • IR Systems • Imprecise Queries • Textual Data • Web Search Engines • Bayes Theorem • Regression Analysis • EM Algorithm • K-Means Clustering • Time Series Analysis • Algorithm Design Techniques • Algorithm Analysis • Data Structures 10/2/2020 • Neural Networks • Decision Tree Algorithms Data Mining: Concepts and Techniques 31
Categorized of Data Mining n Classification according to the kinds of Database mined n Classification according to the kinds of Knowledge Mined n Classification according to the kinds of Technique Utilized n Classification according to the kinds of Application Adapted 10/2/2020 Data Mining: Concepts and Techniques 32
Data Mining: Classification Schemes n n 10/2/2020 General functionality n Descriptive data mining n Predictive data mining Different views, different classifications n Kinds of databases to be mined n Kinds of knowledge to be discovered n Kinds of techniques utilized n Kinds of applications adapted Data Mining: Concepts and Techniques 33
Ex: Time Series Analysis n n 10/2/2020 Example: Stock Market Predict future values Determine similar patterns over time Classify behavior Data Mining: Concepts and Techniques 34
Related Concepts Outline Goal: Examine some areas which are related to data mining. n n n n 10/2/2020 Database/On Line Transaction Processing Systems Fuzzy Sets and Logic Information Retrieval(Web Search Engines) On Line Analytic Processing /DSS Statistics Machine Learning Pattern Matching Data Mining: Concepts and Techniques 35
Major Issues in Data Mining n Regarding n n n 10/2/2020 Mining Methodology and User Interaction Performance and Diverse Data Types Data Mining: Concepts and Techniques 36
Major Issues in Data Mining Methodology and User Interaction n This is the kinds of Knowledge Mined n n n The ability to mine knowledge n n n 10/2/2020 Interactive Mining with refining knowledge with OLAP Incorporation of backward knowledge with deduction rule The use of domain knowledge and Ad hoc mining n n Different users have different knowledge Hence, spectrum of data analysis and knowledge discovery tasks with Dmining functionalities These tasks may use same database with different techniques Which leads different discovery Which is used for efficient data mining Knowledge Visualization like graphs, charts, curves Data Mining: Concepts and Techniques 37
Major Issues in Data Mining Performance Issues n It includes Efficiency, Scalability and Parallelism of DM Algorithms n Efficiency and scalability of Data Mining Algorithm n n n Parallel, Distributed and Incremental DM Algorithms n n 10/2/2020 DM Algorithms should be efficient and scalable The running time to be predictable for large databases Computational Complexity should be analyzed Algorithm has to divide the data into partitions and then can proceed in parallel The results can merge later If need, the dbase can update (incremental) before computation Data Mining: Concepts and Techniques 38
Major Issues in Data Mining Diversity of Database Types n n n n 10/2/2020 Relational Database Data Warehouse Complex Data Objects Hypertext and Multimedia Data Spatial Data Temporal Data and etc and Internet having Heterogeneous Data. Bases Hence, we need an efficient and effective data mining systems for different dbases, for different goals Data Mining: Concepts and Techniques 39
Summary n Database Technology for DBMSystems n Data Mining for discovering interesting patterns n n 10/2/2020 Knowledge Discovery Process includes data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and knowledge presentation Data Warehouse organized way for decision making Data Mining Functionalities Description, association, Classification, Prediction, clustering, trend analysis, deviation analysis, similarity analysis Data Mining Systems Data Mining: Concepts and Techniques 40
Exercises What is data mining? In your answer, address the following: Ø Is it another hype? Ø Is it a simple transformation of technology developed from databases, statistics, and machine learning? Ø Explain how the evolution of database technology led to data mining. Ø Describe the steps involved in data mining when viewed as a process of knowledge discovery. 10/2/2020 Data Mining: Concepts and Techniques 41
Exercises What is data mining? In your answer, address the following: Data mining refers the process or method that extracts or "mines" interesting knowledge or patterns from large amounts of data. Ø Is it another hype? Ø Data mining is not another hype. Instead, the need for data mining has arisen due to the wide availability of huge amounts of data and the imminent need for turning such data, into useful information and knowledge. Thus, data mining can be viewed as the result of the natural evolution of information technology. 10/2/2020 Data Mining: Concepts and Techniques 42
Exercises Is it a simple transformation of technology developed from databases, statistics, and machine learning? Ø No. Data mining is more than a simple transformation of technology developed from databases, statistics, and machine learning. Ø Instead, data mining involves an integration, rather than a simple transformation, of techniques from multiple, disciplines such as database technology, statistics, machine learning, high-performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial data analysis. 10/2/2020 Data Mining: Concepts and Techniques 43
Exercises Explain how the evolution of database technology led to data mining. Ø Database technology began with the development of data collection and database creation mechanisms that, led to the development of effective mechanisms for data management including data storage and retrieval, and query and transaction processing. The large number of database systems offering query and transaction processing eventually and naturally led to the need for data analysis and understanding. Hence, data mining began its development out of this necessity. 10/2/2020 Data Mining: Concepts and Techniques 44
Exercises Describe the steps involved in data mining when viewed as a process of knowledge discovery. The steps involved in data mining when viewed as a process of knowledge discovery arc as follows: o o o o 10/2/2020 Data cleaning, a process that removes or transforms noise and inconsistent data Data integration, where multiple data sources may be combined Data selection, where data relevant to the analysis task are retrieved from the database Data transformation, where data are transformed or consolidated into forms appropriate for mining Data mining, an essential process where intelligent and efficient methods are applied in order to extract, patterns Pattern evaluation, a process that identifies the truly interesting patterns representing knowledge leased on some interestingness measures Knowledge presentation, where visualization and knowledge representation techniques are used to present the mined knowledge to the user Data Mining: Concepts and Techniques 45
Exercises 1. 2. 3. 4. 5. 10/2/2020 Distinguish between data warehouse and database Applications of Data Mining Architecture and KDD Process of Data Mining Functionalities Challenges and Issues Data Mining: Concepts and Techniques 46
DATA WAREHOUSE 1. 2. 3. 4. 5. 10/2/2020 Distinguish between data warehouse and database Applications of Data Mining Architecture and KDD Process of Data Mining Functionalities Challenges and Issues Data Mining: Concepts and Techniques 47
10/2/2020 Data Mining: Concepts and Techniques 48
Data Warehousing and OLAP Technology for Data Mining n What is a data warehouse? n A multi-dimensional data model n Data warehouse architecture n Data warehouse implementation n Further development of data cube technology n From data warehousing to data mining 10/2/2020 Data Mining: Concepts and Techniques 49
What is Data Warehouse? n Defined in many different ways n n n 10/2/2020 A decision support database that is maintained separately from the organization’s operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process. ” —W. H. Inmon Data Mining: Concepts and Techniques 50
Data Warehouse—Subject. Oriented n Organized around major subjects, such as customer, product, sales. n Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. n Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. 10/2/2020 Data Mining: Concepts and Techniques 51
Data Warehouse—Integrated n Constructed by integrating multiple, heterogeneous data sources n n Data cleaning and data integration techniques are applied. n n 10/2/2020 relational databases, flat files, on-line transaction records Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources When data is moved to the warehouse, it is converted. Data Mining: Concepts and Techniques 52
Data Warehouse—Time Variant n The time horizon for the data warehouse is significantly longer than that of operational systems. n n n Operational database: current value data. Data warehouse data: provide information from a historical perspective (e. g. , past 5 -10 years) Every key structure in the data warehouse n n 10/2/2020 Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element”. Data Mining: Concepts and Techniques 53
Data Warehouse—Non-Volatile n A physically separate store of data transformed from the operational environment. n Operational update of data does not occur in the data warehouse environment. n Does not require transaction processing, recovery, and concurrency control mechanisms n Requires only two operations in data accessing: n 10/2/2020 initial loading of data and access of data. Data Mining: Concepts and Techniques 54
- Slides: 54