BIT 4514 Database Technology for Business Fall 2019

BIT 4514: Database Technology for Business Fall 2019 Data mining 1

What is Data Mining? • Many Definitions – Non-trivial extraction of implicit, previously unknown and potentially useful information from data – Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns 2

Example Data mining Questions • Which of the approximately 30 K VT students will buy a new car within the next year? • Which of our customers would be interested paying a premium for the latest technology? • Would the residents of Virginia or North Carolina likely provide more revenue for our business if we chose to relocate? • Which of our high risk mortgage applicants are likely to file for bankruptcy? 3

Origins of Data Mining • Ideas come from many disciplines including machine learning/AI, pattern recognition, statistics, and database systems Statistics/ • Traditional Techniques Machine Learning/ AI Pattern may be unsuitable due to: – Enormity of data – High dimensionality of the data – Heterogeneous, distributed nature of data Recognition Data Mining Database systems 4

Types of Data Mining Algorithms • Supervised algorithms (Classification) – Learning by example – Use training data which has correct answers (class label attribute) – Create a model by running the algorithm on the training data – Identify a class label for the incoming new data • Unsupervised algorithms (Clustering) – Do not use training data – Classes may not be known in advance 5

Types of Data Mining Algorithms • Supervised (Classification) – – – – Decision Trees Regression Neural Networks Genetic Algorithms Support Vector Machines K-Nearest Neighbors Association Rules Bayesian Classification • Unsupervised – Clustering 6

Classification: Description • Given a collection of records – Each record contains a set of attributes, one of the attributes is the dependent variable/class • Find a model to predict the class attribute as a function of the values of the other attributes • Goal: previously unseen records should be assigned to a class as accurately as possible – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 7

Classification Approach l al ric ca go e t o a ric ca g te s in nt o c u uo s as cl Test Set Training Set Learn Classifier Model

Classification: Example 1 • Direct Marketing – Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. – Approach: • Use the data for a similar product introduced before • We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision (binary attribute) forms the class attribute • Collect various demographic, lifestyle, and company-interaction related information about all such customers. – Type of business, where they stay, how much they earn, etc. • Use this information as input attributes to learn a classifier model. 9

Classification: Example 2 • Fraud Detection – Goal: Predict fraudulent cases in credit card transactions. – Approach: • Use credit card transactions and the associated accountholder information as attributes. – When does a customer buy, what does he/she buy, how often does he/she pay on time, etc. • Label past transactions as fraud or fair transactions. This forms the class attribute. • Train the model • Use this model to detect fraud by observing credit card transactions on an account 10

Classification: Example 3 • Customer Attrition/Churn: – Goal: To predict whether a customer is likely to be lost to a competitor. – Approach: • Use detailed record of transactions with each of the past and present customers, to find attributes. – How often the customer calls, where he/she calls, what time-of-the day he/she calls most, his/her financial status, marital status, etc. • Label the customers as loyal or disloyal • Develop a model for loyalty 11

Classification approach: k-Nearest Neighbor • Basic idea: – Look at characteristics / attributes – “If it walks like a duck and quacks like a duck, then it’s probably a duck” Compute Distance Training Records Test Record Choose the k “nearest” records 12

Nearest-Neighbor Classifier l Requires three things – The set of stored records – Distance Metric to compute the distance between records – The value of k, the number of nearest neighbors to retrieve l To classify an unknown record: – Compute distance to other training records – Identify k nearest neighbors – Use class labels of nearest neighbors to determine the class label of unknown record (e. g. , by taking majority vote, weighted distance)

Illustration of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distances to x

Nearest Neighbor Classification… • Choosing the value of k: – If k is too small, the model is sensitive to noise – If k is too large, neighborhood may include too many points from other classes 15

Unsupervised Algorithm: Clustering • Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that – Data points in one cluster are more similar to one another – Data points in separate clusters are less similar to one another • Similarity Measures: – Euclidean Distance (if attributes are continuous) – Other Problem-specific Measures 16

Clustering: Example 1 • Market Segmentation: – Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix – Approach: • Collect different attributes of customers based on their geographical and lifestyle related information. • Find clusters of similar customers • Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters 17

Clustering: Example 2 • Document Clustering: – Goal: To find groups of documents that are similar to each other based on the important terms appearing in them – Approach: To identify frequently occurring terms in each document, form a similarity measure based on the frequencies of different terms, and use it to cluster – Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents 18

Document Clustering • Clustering Points: Twitter feeds / blog comments • Similarity Measure: How many words are common in these “documents” (after some word filtering) • Applications: (1) Identify issues with a product more quickly and with greater detail (2) Identify the occurrence of / details about a disaster event as it is in the process of occurring (used for flooding in Oklahoma and North Dakota) 19