DATA MINING Introductory and Advanced Topics Part I

  • Slides: 49
Download presentation
DATA MINING Introductory and Advanced Topics Part I Margaret H. Dunham Department of Computer

DATA MINING Introductory and Advanced Topics Part I Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by Dr. M. H. Dunham, Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002. Modified by Elizabeth Leon © Prentice Hall 1

Data Mining Outline n PART I – Introduction – Related Concepts – Data Mining

Data Mining Outline n PART I – Introduction – Related Concepts – Data Mining Techniques n PART II – Classification – Clustering – Association Rules PART III – Web Mining – Spatial Mining – Temporal Mining n © Prentice Hall 2

Introduction Outline Goal: Provide an overview of data mining. Define data mining and KDD

Introduction Outline Goal: Provide an overview of data mining. Define data mining and KDD n Data mining vs. databases n Basic data mining tasks n Data mining development n Data mining issues n © Prentice Hall 3

Introduction Data is growing at a phenomenal rate n Huge DB’s contain a wealth

Introduction Data is growing at a phenomenal rate n Huge DB’s contain a wealth of info, still not fully exploited (valuable info (gold!) may be lurking within data). n Users expect more sophisticated information n How? UNCOVER HIDDEN INFORMATION DATA MINING n © Prentice Hall 4

Data Mining vs. KDD Knowledge Discovery in Databases (KDD): discovering useful info. and knowledge

Data Mining vs. KDD Knowledge Discovery in Databases (KDD): discovering useful info. and knowledge from huge data repositories (patterns, associations, …etc) KDD Data Mining: Intelligent methods for extracting knowledge/digging for gold. © Prentice Hall 5

KDD Process Modified from [FPSS 96 C] n n n Selection: Obtain data from

KDD Process Modified from [FPSS 96 C] n n n Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. © Prentice Hall 6

KDD Process Ex: Web Log n Selection: – Select log data (dates and locations)

KDD Process Ex: Web Log n Selection: – Select log data (dates and locations) to use n Preprocessing: – Remove identifying URLs – Remove error logs n Transformation: – Sessionize (sort and group) n Data Mining: – Identify and count patterns – Construct data structure n Interpretation/Evaluation: – Identify and display frequently accessed sequences. n Potential User Applications: – Cache prediction – Personalization © Prentice Hall 7

Data Mining is not… n n n Searching for a phone number in a

Data Mining is not… n n n Searching for a phone number in a phone book Searching for keywords on Google Generating a histogram of salaries for different age groups is… n n Finding groups of people with similar hobbies. Are chances of getting cancer higher if you live near of a power line? © Prentice Hall 8

Data Mining Fit data to a model n Similar terms n – Exploratory data

Data Mining Fit data to a model n Similar terms n – Exploratory data analysis – Data driven discovery – Deductive learning © Prentice Hall 9

Data Mining Algorithm n Objective: Fit Data to a Model – Descriptive – Predictive

Data Mining Algorithm n Objective: Fit Data to a Model – Descriptive – Predictive Preference – Technique to choose the best model n Search – Technique to search the data n – “Query” © Prentice Hall 10

Database Processing vs. Data Mining Processing n Query n – Poorly defined – No

Database Processing vs. Data Mining Processing n Query n – Poorly defined – No precise query language – Well defined – SQL n Data n – Operational data n Query Data – Not operational data Output n – Precise – Subset of database Output – Fuzzy – Not a subset of database © Prentice Hall 11

Query Examples n Database – Find all credit applicants with last name of Smith.

Query Examples n Database – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10, 000 in the last month. – Find all customers who have purchased milk n Data Mining – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules) © Prentice Hall 12

Data Mining Models and Tasks © Prentice Hall 13

Data Mining Models and Tasks © Prentice Hall 13

Basic Data Mining Tasks n Summarization maps data into subsets with associated simple descriptions.

Basic Data Mining Tasks n Summarization maps data into subsets with associated simple descriptions. – Characterization – Generalization n Link Analysis uncovers relationships among data. – Affinity Analysis – Association Rules – Sequential Analysis determines sequential patterns. © Prentice Hall 14

Basic Data Mining Tasks (cont’d) § Discovering association relationships/correlations among a set of items

Basic Data Mining Tasks (cont’d) § Discovering association relationships/correlations among a set of items in the form of rules: X Y (DB tuples satisfying X are likely to satisfy Y) Association Applications – Maintaining services 98% of the people that buy tires and car accessories also got maintaining services – Web page recomendacions (URL 1 & URL 3 -> URL 5) 60% of the web users who visit page A and B bought the item T 1 © Prentice Hall 15

Basic Data Mining Tasks (cont’d) n Regression is used to map a data item

Basic Data Mining Tasks (cont’d) n Regression is used to map a data item to a real valued prediction variable. n Classification maps data into predefined groups or classes – Supervised learning – Pattern recognition – Prediction © Prentice Hall 16

Classification Process (Step 1): Model Construction (2 -class problem) Classification Algorithms Training Data Classifier

Classification Process (Step 1): Model Construction (2 -class problem) Classification Algorithms Training Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ © Prentice Hall 17

Classification Process (Step 2): Use the Model in Prediction Classifier Testing Data Unseen Data

Classification Process (Step 2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured? © Prentice Hall 18

Basic Data Mining Tasks (cont’d) Classification Applications • • • Credit Medical diagnosis Text

Basic Data Mining Tasks (cont’d) Classification Applications • • • Credit Medical diagnosis Text mining Recommendation of web pages Intrusion detection, Fraud detection (security) © Prentice Hall 19

Basic Data Mining Tasks (cont’d) n Clustering groups similar data together into clusters. –

Basic Data Mining Tasks (cont’d) n Clustering groups similar data together into clusters. – Unsupervised learning – Segmentation – Partitioning © Prentice Hall 20

Basic Data Mining Tasks (cont’d) Clustering Applications • Image Processing (segmenting color images in

Basic Data Mining Tasks (cont’d) Clustering Applications • Image Processing (segmenting color images in regions) • Indexing text and images • WWW – Web pages clasification (used for search engines Google) – Gropu web log for discovery groups of patterns of similar access (web usage profiles) © Prentice Hall 21

Ex: Time Series Analysis n n Example: Stock Market Predict future values Determine similar

Ex: Time Series Analysis n n Example: Stock Market Predict future values Determine similar patterns over time Classify behavior © Prentice Hall 22

Data Mining Development • Relational Data Model • SQL • Association Rule Algorithms •

Data Mining Development • Relational Data Model • SQL • Association Rule Algorithms • Data Warehousing • Scalability Techniques • Similarity Measures • Hierarchical Clustering • IR Systems • Imprecise Queries • Textual Data • Web Search Engines • Bayes Theorem • Regression Analysis • EM Algorithm • K-Means Clustering • Time Series Analysis • Algorithm Design Techniques • Algorithm Analysis • Data Structures • Neural Networks • Decision Tree Algorithms © Prentice Hall 23

KDD Issues (cont’d) Multimedia Data n Missing Data n Irrelevant Data n Noisy Data

KDD Issues (cont’d) Multimedia Data n Missing Data n Irrelevant Data n Noisy Data n Changing Data n Integration n Application n © Prentice Hall 24

Social Implications of DM Privacy n Profiling n Unauthorized use n © Prentice Hall

Social Implications of DM Privacy n Profiling n Unauthorized use n © Prentice Hall 25

Data Mining Metrics Usefulness n Return on Investment (ROI) n Accuracy n Space/Time n

Data Mining Metrics Usefulness n Return on Investment (ROI) n Accuracy n Space/Time n © Prentice Hall 26

Database Perspective on Data Mining Scalability n Real World Data n Updates n Ease

Database Perspective on Data Mining Scalability n Real World Data n Updates n Ease of Use n © Prentice Hall 27

Related Concepts Outline Goal: Examine some areas which are related to data mining. n

Related Concepts Outline Goal: Examine some areas which are related to data mining. n Database/OLTP Systems n Fuzzy Sets and Logic n Information Retrieval(Web Search Engines) n Dimensional Modeling n Data Warehousing n OLAP/DSS n Statistics n Machine Learning n Pattern Matching © Prentice Hall 28

DB & OLTP Systems n Schema – (ID, Name, Address, Salary, Job. No) n

DB & OLTP Systems n Schema – (ID, Name, Address, Salary, Job. No) n Data Model – ER – Relational n n Transaction Query: SELECT Name FROM T WHERE Salary > 100000 DM: Only imprecise queries © Prentice Hall 29

Fuzzy Sets and Logic n n Fuzzy Set: Set membership function is a real

Fuzzy Sets and Logic n n Fuzzy Set: Set membership function is a real valued function with output in the range [0, 1]. f(x): Probability x is in F. 1 -f(x): Probability x is not in F. EX: – T = {x | x is a person and x is tall} – Let f(x) be the probability that x is tall – Here f is the membership function DM: Prediction and classification are fuzzy. © Prentice Hall 30

Fuzzy Sets © Prentice Hall 31

Fuzzy Sets © Prentice Hall 31

Classification/Prediction is Fuzzy Loan Reject Amnt Accept Simple Fuzzy © Prentice Hall 32

Classification/Prediction is Fuzzy Loan Reject Amnt Accept Simple Fuzzy © Prentice Hall 32

Information Retrieval n n n Information Retrieval (IR): retrieving desired information from textual data.

Information Retrieval n n n Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: Find all documents about “data mining”. DM: Similarity measures; Mine text/Web data. © Prentice Hall 33

Information Retrieval (cont’d) Similarity: measure of how close a query is to a document.

Information Retrieval (cont’d) Similarity: measure of how close a query is to a document. n Documents which are “close enough” are retrieved. n Metrics: – Precision = |Relevant and Retrieved| |Retrieved| – Recall = |Relevant and Retrieved| |Relevant| n © Prentice Hall 34

IR Query Result Measures and Classification IR Classification © Prentice Hall 35

IR Query Result Measures and Classification IR Classification © Prentice Hall 35

Dimensional Modeling n n n View data in a hierarchical manner more as business

Dimensional Modeling n n n View data in a hierarchical manner more as business executives might Useful in decision support systems and mining Dimension: collection of logically related attributes; axis for modeling data. Facts: data stored Ex: Dimensions – products, locations, date Facts – quantity, unit price DM: May view data as dimensional. © Prentice Hall 36

Relational View of Data © Prentice Hall 37

Relational View of Data © Prentice Hall 37

Dimensional Modeling Queries Roll Up: more general dimension n Drill Down: more specific dimension

Dimensional Modeling Queries Roll Up: more general dimension n Drill Down: more specific dimension n Dimension (Aggregation) Hierarchy n SQL uses aggregation n Decision Support Systems (DSS): Computer systems and tools to assist managers in making decisions and solving problems. n © Prentice Hall 38

Cube view of Data © Prentice Hall 39

Cube view of Data © Prentice Hall 39

Aggregation Hierarchies © Prentice Hall 40

Aggregation Hierarchies © Prentice Hall 40

Star Schema © Prentice Hall 41

Star Schema © Prentice Hall 41

Data Warehousing n “Subject-oriented, integrated, time-variant, nonvolatile” n n n William Inmon Operational Data:

Data Warehousing n “Subject-oriented, integrated, time-variant, nonvolatile” n n n William Inmon Operational Data: Data used in day to day needs of company. Informational Data: Supports other functions such as planning and forecasting. Data mining tools often access data warehouses rather than operational data. DM: May access data in warehouse. © Prentice Hall 42

Operational vs. Informational Application Use Temporal Modification Orientation Data Size Level Access Response Data

Operational vs. Informational Application Use Temporal Modification Orientation Data Size Level Access Response Data Schema Operational Data Warehouse OLTP Precise Queries Snapshot Dynamic Application Operational Values Gigabits Detailed Often Few Seconds Relational OLAP Ad Hoc Historical Static Business Integrated Terabits Summarized Less Often Minutes Star/Snowflake © Prentice Hall 43

OLAP n n Online Analytic Processing (OLAP): provides more complex queries than OLTP. On.

OLAP n n Online Analytic Processing (OLAP): provides more complex queries than OLTP. On. Line Transaction Processing (OLTP): traditional database/transaction processing. Dimensional data; cube view Visualization of operations: – – – Slice: examine sub-cube. Dice: rotate cube to look at another dimension. Roll Up/Drill Down DM: May use OLAP queries. © Prentice Hall 44

OLAP Operations Roll Up Drill Down Single Cell Multiple Cells © Prentice Hall Slice

OLAP Operations Roll Up Drill Down Single Cell Multiple Cells © Prentice Hall Slice Dice 45

Statistics n n Simple descriptive models Statistical inference: generalizing a model created from a

Statistics n n Simple descriptive models Statistical inference: generalizing a model created from a sample of the data to the entire dataset. Exploratory Data Analysis: – Data can actually drive the creation of the model – Opposite of traditional statistical view. Data mining targeted to business user DM: Many data mining methods come from statistical techniques. © Prentice Hall 46

Machine Learning n n n Machine Learning: area of AI that examines how to

Machine Learning n n n Machine Learning: area of AI that examines how to write programs that can learn. Often used in classification and prediction Supervised Learning: learns by example. Unsupervised Learning: learns without knowledge of correct answers. Machine learning often deals with small static datasets. DM: Uses many machine learning techniques. © Prentice Hall 47

Pattern Matching (Recognition) Pattern Matching: finds occurrences of a predefined pattern in the data.

Pattern Matching (Recognition) Pattern Matching: finds occurrences of a predefined pattern in the data. n Applications include speech recognition, information retrieval, time series analysis. n DM: Type of classification. © Prentice Hall 48

DM vs. Related Topics © Prentice Hall 49

DM vs. Related Topics © Prentice Hall 49