Data Mining Page 1 Syllabus Week Material Week

Data Mining Page 1

Syllabus • • • Week Material Week Introduction Week 2 Data Warehouse & OLAP Week 3 Data Preprocessing Week 4 Data Mining Languages Week 5 Concept Description Week 6 Statistic Week 7 -8 Association Rules Week 9 -10 Classification Week 11 -12 Cluster Analysis Week 13 -14 Mining Complex Data Week 15 Applications • Midterm 3/2/04 • Project due 4/29/04 • Final 5/6/04 • No Late Submissions are allowed Page 2

Textbook and Other Reading Materials • Textbook: Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber, Morgan Kaufman, 2001 • Other texts that I may use from time to time: – Data Mining –Introductory and Advanced Topics by Margaret H. Duhnam, Pearson Education, Inc, 2003 – Principles of Data Mining by David Hand, Heikki Mannila, and Padhriac Smyth, MIT Press 2001 • Papers: VLDB, SIGMOD, and SIGKDD Proceedings` Page 3

Introduction • Motivation. • What is data mining? • Data mining functionality • Are all the patterns interesting? • Classification of data mining systems Page 4

Motivation: • Huge amount of databases and web pages make information extraction next to impossible (remember the favored statement: I will bury them in data!) • Inability of many other disciplines: (statistic, AI, information retrieval) to have scalable algorithms to extract information and/or rules from the databases • Necessity to find relationships among data Page 5

Appetizer • Consider a file consisting of 24471 records. File contains at least two condition attributes: A and D A/D 0 1 total 0 9272 232 9504 1 14695 272 14967 Total 23967 504 24471 Page 6

Appetizer (con’t) • Probability that person has A: P(A)=0. 6, P(D)=0. 02 • Conditional probability that person has D provided it has A: P(D|A) = P(AD)/P(A)=(272/24471)/. 6 =. 02 • P(A|D) = P(AD)/P(D)=. 56 • What can we say about dependencies between A and D? A/D 0 1 total 0 9272 232 9504 1 14695 272 14967 Total 23967 504 24471 Page 7

Appetizer(3) • So far we did not ask anything that statistics would not have ask. So Data Mining another word for statistic? • We hope that the response will be resounding NO • The major difference is that statistical methods work with random data samples, whereas the data in databases is not necessarily random • The second difference is the size of the data set • The third data is that statistical samples do not contain “dirty” data Page 8

STATISTIC is NOT DATA MINING • Originally data mining was a statistician term for overusing data to create possible wrong inferences. • Famous example of wrong inferences is in parapsychology on ECP (extrasensory perception) • If there are too many conclusions from the data, then some will be certainly true. • Data Mining is a discovery of UNEXPECTED data correlations Page 9

What Is Data Mining? • Data mining (knowledge discovery in databases): – Extraction of interesting information or patterns from data in large databases • Alternative names and their “inside stories”: – Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. • What is not data mining? – – (Deductive) query processing. Expert systems or small ML/statistical programs Statistics Artificial Intelligence Page 10

Data Mining: Process Pattern Evaluation – Data mining: the core of knowledge discovery Data Mining process. Task-relevant Data Warehouse Selection Data Cleaning Data Integration Databases Page 11

What Is Data Mining – Steps in the DM Process • Data cleaning, noise removal • Data Integration- data warehousing techniques, OLAP • Data Relevancy decision • Data Transformation (data qube, aggregation and summarization) • Pattern evaluations • Results presentation Page 12

What is DM: Potential Applications • Database analysis and decision support – Market analysis and management • target marketing, customer relation management, market basket analysis, cross selling, market segmentation – Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis – Fraud detection and management • Other Applications – Text mining (news group, email, documents) and Web analysis. – Intelligent query answering Page 13

Market Analysis and Management (1) • Where are the data sources for analysis? – Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies • Target marketing – Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. • Determine customer purchasing patterns over time – Conversion of single to a joint bank account: marriage, etc. • Cross-market analysis – Associations/co-relations between product sales – Prediction based on the association information Page 14

Market Analysis and Management (2) • Customer profiling – data mining can tell you what types of customers buy what products (clustering or classification) • Identifying customer requirements – identifying the best products for different customers – use prediction to find what factors will attract new customers • Provides summary information – various multidimensional summary reports – statistical summary information (data central tendency and variation) Page 15

Corporate Analysis and Risk Management • Finance planning and asset evaluation – cash flow analysis and prediction – contingent claim analysis to evaluate assets – cross-sectional and time series analysis (financial-ratio, trend analysis, etc. ) • Resource planning: – summarize and compare the resources and spending • Competition: – monitor competitors and market directions – group customers into classes and a class-based pricing procedure – set pricing strategy in a highly competitive market Page 16

Fraud Detection and Management (1) • Applications – widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. • Approach – use historical data to build models of fraudulent behavior and use data mining to help identify similar instances • Examples – auto insurance: detect a group of people who stage accidents to collect on insurance – money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) – medical insurance: detect professional patients and ring of doctors and ring of references Page 17

Fraud Detection and Management (2) • Detecting inappropriate medical treatment • Detecting telephone fraud – Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. – British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. • Retail – Analysts estimate that 38% of retail shrink is due to dishonest employees. Page 18

Other Applications • Sports – IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat • Astronomy – JPL and the Palomar Observatory discovered 22 quasars with the help of data mining • Internet Web Surf-Aid – IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. Page 19

Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Database or data warehouse server Data cleaning & data integration Databases Knowledge-base Filtering Data Warehouse Page 20

Data Mining System Architecture • Database, data warehouse, data files- set of data to be mined. Data Cleaning and data integration may be performed at this stage • Database or data warehouse server is responsible for fetching relevant data. How to define relevancy? • Knowledge Base – Domain knowledge that drives a search for patterns. Concept hierarchy, User Beliefs, Interestingness Constraints • Data Mining Engine-Functional algorithms to perform a search for domain experts • Pattern Evaluation – Use knowledge base and other methods to narrow search for domain patters • GUI – Communicator between users and data mining system Page 21

Data Mining: On What Kind of Data? • Relational databases – Universal relation vs Multirelational search • Data warehouses • Transactional databases • Advanced DB and information repositories – – – Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Text databases and multimedia databases Heterogeneous and legacy databases WWW Page 22

Data Mining: On What Kind of Data? • Attribute Types: – Categorical – attribute that has a finite number of values – Ordinal – attributes can be ordered by their values • Attribute Transformations: – Continuing - attribute that may have infinite but countable set of values. These attributes always can be ordered – Interval scale – Boolean • Nominal – attributes that cannot be ordered by their values – Operational - example measurement of programming productivity as am(n+m)log(a+b)/2 b, where a is the number of unique operators, b is the number of unique operands, n-number of total operators occurences and m the number of total operands occurences Page 23

Data Mining Tasks • Association (correlation and causality) – Multi-dimensional vs. single-dimensional association – age(X, “ 20. . 29”) ^ income(X, “ 20. . 29 K”) -> buys(X, “PC”) [support = 2%, confidence = 60%] – contains(T, “computer”) -> contains(x, “software”) [1%, 75%] – What is support? – the percentage of the tuples in the database that have age between 20 and 29 and income between 20 K and 29 K and buying PC – What is confidence? – the probability that if person is between 20 and 29 and income between 20 K and 29 K then it buys PC • Clustering (getting data that are close together into the same cluster. • What does “close together” means? Page 24

Distances between data • Distance between data is a measure of dissimilarity between data. d(i, j)>=0; d(i, j) = d(j, i); d(i, j)<= d(i, k) + d(k, j) • Euclidean distance: <x 1, x 2, … xk> and <y 1, y 2, …yk> • Standardize variables by finding standard deviation and dividing each xi by standard deviation of X • Covariance(X, Y)=1/k(Sum(xi-mean(x))(y(I)-mean(y)) • Boolean variables and their distances Page 25

Data Mining Tasks • Outlier analysis – Outlier: a data object that does not comply with the general behavior of the data – It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis • Trend and evolution analysis – Trend and deviation: regression analysis – Sequential pattern mining, periodicity analysis – Similarity-based analysis • Other pattern-directed or statistical analyses Page 26

Are All the “Discovered” Patterns Interesting? • A data mining system/query may generate thousands of patterns, not all of them are interesting. – Suggested approach: Human-centered, query-based, focused mining • Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm • Objective vs. subjective interestingness measures: – Objective: based on statistics and structures of patterns, e. g. , support, confidence, etc. – Subjective: based on user’s belief in the data, e. g. , unexpectedness, novelty, actionability, etc. Page 27

Are All the “Discovered” Patterns Interesting? - Example coffee 0 1 0 5 70 1 5 tea 20 75 25 Conditional probability that if one buys coffee, one also buys tea is 2/9 Conditional probability that if one buys tea she also buys coffee is 20/25=. 8 However, the probability that she buys coffee is. 9 So, is it significant inference that if customer buys tea she also buys coffee? Is buying tea and coffee independent activities? Page 28

How to measure Interestingness • RI = | X , Y| - |X||Y|/N • Support and Confidence: |X Y|/N – support and |X Y|/|X| confidence of X->Y • Chi^2: (|XY| - E(|XY|)) ^2 /E(|XY|); • J(X->Y) = P(Y)(P(X|Y)*log (P(X|Y)/P(X)) + (1 - P(X|Y))*log ((1 P(X|Y)/(1 -P(X)) • Sufficiency (X->Y) = P(X|Y)/P(X|!Y); Necessity (X->Y) = P(!X|Y)/P(!X|!Y). Interestingness of Y->X is NC++ = 1 -N(X->Y)*P(Y), if N(…) is less than 1 or 0 otherwise Page 29

Can We Find All and Only Interesting Patterns? • Find all the interesting patterns: Completeness – Can a data mining system find all the interesting patterns? – Association vs. classification vs. clustering • Search for only interesting patterns: Optimization – Can a data mining system find only the interesting patterns? – Approaches • First general all the patterns and then filter out the uninteresting ones. • Generate only the interesting patterns—mining query optimization Page 30

A Multi-Dimensional View of Data Mining Classification • Databases to be mined – Relational, transactional, object-oriented, object-relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc. • Knowledge to be mined – Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc. – Multiple/integrated functions and mining at multiple levels • Techniques utilized – Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc. • Applications adapted – Retail, telecommunication, banking, fraud analysis, DNA mining, stock Page 31 market analysis, Web mining, Weblog analysis, etc.

OLAP Mining: An Integration of Data Mining and Data Warehousing • Data mining systems, DBMS, Data warehouse systems coupling – No coupling, loose-coupling, semi-tight-coupling, tight-coupling • On-line analytical mining data – integration of mining and OLAP technologies • Interactive mining multi-level knowledge – Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. • Integration of multiple mining functions – Characterized classification, first clustering and then association Page 32

An OLAM Architecture Mining query Mining result Layer 4 User Interface User GUI API OLAM Engine OLAP Engine Layer 3 OLAP/OLAM Data Cube API Layer 2 MDDB Meta Data Filtering&Integration Database API Filtering Layer 1 Databases Data cleaning Data integration Warehouse Data Repository Page 33

Major Issues in Data Mining (1) • Mining methodology and user interaction – Mining different kinds of knowledge in databases – Interactive mining of knowledge at multiple levels of abstraction – Incorporation of background knowledge – Data mining query languages and ad-hoc data mining – Expression and visualization of data mining results – Handling noise and incomplete data – Pattern evaluation: the interestingness problem • Performance and scalability – Efficiency and scalability of data mining algorithms – Parallel, distributed and incremental mining methods Page 34

Major Issues in Data Mining (2) • Issues relating to the diversity of data types – Handling relational and complex types of data – Mining information from heterogeneous databases and global information systems (WWW) • Issues related to applications and social impacts – Application of discovered knowledge • Domain-specific data mining tools • Intelligent query answering • Process control and decision making – Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem – Protection of data security, integrity, and privacy Page 35

Summary • Data mining: discovering interesting patterns from large amounts of data • A natural evolution of database technology, in great demand, with wide applications • A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation • Mining can be performed in a variety of information repositories • Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. • Classification of data mining systems • Major issues in data mining Page 36