Data Mining Concepts and Techniques 1 Introduction n

  • Slides: 27
Download presentation
Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques 1

Introduction n Motivation: Why data mining? n What is data mining? n Data Mining:

Introduction n Motivation: Why data mining? n What is data mining? n Data Mining: On what kind of data? n Data mining functionality n Are all the patterns interesting? n Classification of data mining systems n Major issues in data mining 2

Why Data Mining? n The Explosive Growth of Data: from terabytes to petabytes n

Why Data Mining? n The Explosive Growth of Data: from terabytes to petabytes n Data collection and data availability n Automated data collection tools, database systems, Web, computerized society n Major sources of abundant data n Business: Web, e-commerce, transactions, stocks, … n Science: Remote sensing, bioinformatics, scientific simulation, … n Society and everyone: news, digital cameras, n We are drowning in data, but starving for knowledge! n “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 3

Evolution of Database Technology n 1960 s: n n 1970 s: n n n

Evolution of Database Technology n 1960 s: n n 1970 s: n n n Relational data model, relational DBMS implementation 1980 s: n RDBMS, advanced data models (extended-relational, OO, deductive, etc. ) n Application-oriented DBMS (spatial, scientific, engineering, etc. ) 1990 s: n n Data collection, database creation, IMS and network DBMS Data mining, data warehousing, multimedia databases, and Web databases 2000 s n Stream data management and mining n Data mining and its applications n Web technology (XML, data integration) and global information systems 4

What Is Data Mining? n Data mining (knowledge discovery from data) n Extraction of

What Is Data Mining? n Data mining (knowledge discovery from data) n Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data n Alternative name n n Knowledge discovery in databases (KDD) Watch out: Is everything “data mining”? n Query processing n Expert systems or statistical programs 5

Why Data Mining? —Potential Applications n Data analysis and decision support n Market analysis

Why Data Mining? —Potential Applications n Data analysis and decision support n Market analysis and management n Target marketing, customer relationship management (CRM), market basket analysis, market segmentation n Risk analysis and management n Forecasting, customer retention, quality control, competitive analysis n Fraud detection and detection of unusual patterns (outliers) 6

Why Data Mining? —Potential Applications n Other Applications n Text mining (news group, email,

Why Data Mining? —Potential Applications n Other Applications n Text mining (news group, email, documents) and Web mining n Stream data mining n Bioinformatics and bio-data analysis 7

Market Analysis and Management n Where does the data come from? n n Credit

Market Analysis and Management n Where does the data come from? n n Credit card transactions, discount coupons, customer complaint calls Target marketing n n Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time 8

Market Analysis and Management n Cross-market analysis n n Customer profiling n n Associations/co-relations

Market Analysis and Management n Cross-market analysis n n Customer profiling n n Associations/co-relations between product sales, & prediction based on such association What types of customers buy what products Customer requirement analysis n Identifying the best products for different customers n Predict what factors will attract new customers 9

Fraud Detection & Mining Unusual Patterns n Approaches: Clustering & model construction for frauds,

Fraud Detection & Mining Unusual Patterns n Approaches: Clustering & model construction for frauds, outlier analysis n Applications: Health care, retail, credit card service, telecomm. n n Medical insurance n Professional patients, and ring of doctors n Unnecessary or correlated screening tests Telecommunications: n n Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm Retail industry n Analysts estimate that 38% of retail shrink is due to dishonest employees 10

Other Applications n Internet Web Surf-Aid n IBM Surf-Aid applies data mining algorithms to

Other Applications n Internet Web Surf-Aid n IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. 11

Data Mining: A KDD Process n Data mining—core of knowledge discovery process Pattern Evaluation

Data Mining: A KDD Process n Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Warehouse Selection Data Cleaning Data Integration Databases 12

Steps of a KDD Process n Learning the application domain n n Creating a

Steps of a KDD Process n Learning the application domain n n Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation n n Summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation n n Find useful features, dimensionality/variable reduction. Choosing functions of data mining n n Relevant prior knowledge and goals of application Visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge 13

Architecture: Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Database

Architecture: Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Database or data warehouse server Data cleaning & data integration Databases Knowledge-base Filtering Data Warehouse 14

Data Mining: On What Kinds of Data? n n Relational database Data warehouse Transactional

Data Mining: On What Kinds of Data? n n Relational database Data warehouse Transactional database Advanced database and information repository n Spatial and temporal data n Time-series data n Stream data n Multimedia database n Text databases & WWW 15

Data Mining Functionalities n Concept description: Characterization and discrimination n n Association (correlation and

Data Mining Functionalities n Concept description: Characterization and discrimination n n Association (correlation and causality) n n Generalize, summarize, and contrast data characteristics Diaper àBeer [0. 5%, 75%] Classification and Prediction n n Construct models (functions) that describe and distinguish classes or concepts for future prediction Presentation: decision-tree, classification rule, neural network 16

Data Mining Functionalities n n n Cluster analysis n Class label is unknown: Group

Data Mining Functionalities n n n Cluster analysis n Class label is unknown: Group data to form new classes, e. g. , cluster houses to find distribution patterns n Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis n Outlier: a data object that does not comply with the general behavior of the data n Useful in fraud detection, rare events analysis Trend and evolution analysis n Trend and deviation: regression analysis n Sequential pattern mining, periodicity analysis 17

Are All the “Discovered” Patterns Interesting? n Data mining may generate thousands of patterns:

Are All the “Discovered” Patterns Interesting? n Data mining may generate thousands of patterns: Not all of them are interesting n n Suggested approach: Human-centered, query-based, focused mining Interestingness measures n A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm n Objective vs. subjective interestingness measures n Objective: based on statistics and structures of patterns, e. g. , support, confidence, etc. n Subjective: based on user’s belief in the data, e. g. , unexpectedness, novelty. 18

Data Mining: Confluence of Multiple Disciplines Database Systems Machine Learning Algorithm Statistics Data Mining

Data Mining: Confluence of Multiple Disciplines Database Systems Machine Learning Algorithm Statistics Data Mining Visualization Other Disciplines 19

Data Mining: Classification Schemes n Different views, different classifications n Kinds of data to

Data Mining: Classification Schemes n Different views, different classifications n Kinds of data to be mined n Kinds of knowledge to be discovered n Kinds of techniques utilized n Kinds of applications adapted 20

Multi-Dimensional View of Data Mining n Data to be mined n n Relational, data

Multi-Dimensional View of Data Mining n Data to be mined n n Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, WWW Knowledge to be mined n n Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels 21

Multi-Dimensional View of Data Mining n Techniques utilized n n Database-oriented, data warehouse (OLAP),

Multi-Dimensional View of Data Mining n Techniques utilized n n Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted n Retail, telecommunication, banking, fraud analysis, biodata mining, stock market analysis, Web mining, etc. 22

OLAP Mining: Integration of Data Mining and Data Warehousing n Data mining systems, DBMS,

OLAP Mining: Integration of Data Mining and Data Warehousing n Data mining systems, DBMS, Data warehouse systems coupling n On-line analytical mining data n n Integration of mining and OLAP technologies Interactive mining multi-level knowledge n Necessity of mining knowledge and patterns at different levels of abstraction. n Integration of multiple mining functions n Characterized classification, first clustering and then association 23

Major Issues in Data Mining n Mining methodology n Mining different kinds of knowledge

Major Issues in Data Mining n Mining methodology n Mining different kinds of knowledge from diverse data types, e. g. , bio, stream, Web n Performance: efficiency, effectiveness, and scalability n Pattern evaluation: the interestingness problem n Incorporation of background knowledge n Handling noise and incomplete data n n Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion 24

Major Issues in Data Mining n User interaction n Data mining query languages and

Major Issues in Data Mining n User interaction n Data mining query languages and ad-hoc mining n Expression and visualization of data mining results n n Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts n n Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy 25

Summary n n n Data mining: discovering interesting patterns from large amounts of data

Summary n n n Data mining: discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. n Data mining systems and architectures n Major issues in data mining 26

Where to Find References? n More conferences on data mining n n n PAKDD

Where to Find References? n More conferences on data mining n n n PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc. Data mining and KDD n Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. n Journal: Data Mining and Knowledge Discovery, KDD Explorations Database systems n Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA n Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc. AI & Machine Learning n Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc. n Journals: Machine Learning, Artificial Intelligence, etc. Statistics n Conferences: Joint Stat. Meeting, etc. n Journals: Annals of statistics, etc. Visualization n Conference proceedings: CHI, ACM-SIGGraph, etc. n Journals: IEEE Trans. visualization and computer graphics, etc. 27