CSE 881 Data Mining Lecture 1 Introduction 1

  • Slides: 26
Download presentation
CSE 881: Data Mining Lecture 1: Introduction 1

CSE 881: Data Mining Lecture 1: Introduction 1

What is Data Mining? Definition 1 l l A field of study in computer

What is Data Mining? Definition 1 l l A field of study in computer science that focuses on how to automatically draw interesting insights from data Lies at the intersection of database system, artificial intelligence, machine learning, statistics, and other related disciplines 2

What is Data Mining? Definition 2 l The process of automatically discovering useful information

What is Data Mining? Definition 2 l The process of automatically discovering useful information from large data repositories What is data? What are the data mining tasks? 3

What is Data? l Collection of objects and their attributes l An object is

What is Data? l Collection of objects and their attributes l An object is also known as record, data point, sample, entity, or instance l An attribute is a property or characteristic of an object – Examples: eye color of a person, temperature, etc. Attributes Objects – Attribute is also known as variable, field, characteristic, feature, or observation 4

Types of Data 5

Types of Data 5

Primary vs Secondary Data Analysis l David Hand (Data mining: Statistics and More? )

Primary vs Secondary Data Analysis l David Hand (Data mining: Statistics and More? ) – Primary data analysis u Data is generated with a particular question in mind through careful design of experiments u Data is analyzed to prove or disprove a hypothesis – Secondary data analysis l u Data is collected without specific question in mind u Data is analyzed to model its underlying structure and find consistent and replicable patterns Data mining is mostly concerned with secondary data analysis 6

Attribute Values l Attribute values are numbers or symbols assigned to an attribute l

Attribute Values l Attribute values are numbers or symbols assigned to an attribute l There is a distinction between attributes and attribute values – The same attribute can be mapped to different attribute values u Example: height can be measured in feet or meters – Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers u But properties of attribute values can be different u – ID has no limit but age has a maximum and minimum value 7

Properties of Attribute Values l It is important to understand the properties of your

Properties of Attribute Values l It is important to understand the properties of your attribute values This scale preserves only the ordering property of length. This scale preserves the ordering and additivity properties of length. 8

Types of Attributes l There are different types of attributes – Nominal u Examples:

Types of Attributes l There are different types of attributes – Nominal u Examples: ID numbers, eye color, zip codes – Ordinal Examples: rankings (e. g. , taste of potato chips on a scale from 110), grades, height in {tall, medium, short} u – Interval u Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Ratio u Examples: temperature in Kelvin, length, time, counts 9

Properties of Attribute Values l The type of an attribute depends on which of

Properties of Attribute Values l The type of an attribute depends on which of the following properties it possess: = – – Distinctness: – – Nominal attribute: distinctness Order: < > Addition: + - Multiplication: */ Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties 10

Difference Between Ratio and Interval l Is it physically meaningful to say that a

Difference Between Ratio and Interval l Is it physically meaningful to say that a temperature of 10 ° twice hotter than that of 5° on – the Celsius scale? – the Fahrenheit scale? – the Kelvin scale? 11

Type of attribute can affect type of operations that can be performed 12

Type of attribute can affect type of operations that can be performed 12

Qualitative and Quantitative Attributes l Qualitative Attribute – Nominal – Ordinal l Quantitative Attribute

Qualitative and Quantitative Attributes l Qualitative Attribute – Nominal – Ordinal l Quantitative Attribute – Interval – Ratio 13

Discrete and Continuous Attributes l Discrete Attribute – Has only a finite or countably

Discrete and Continuous Attributes l Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes l Continuous Attribute – Has infinite set of real numbers as attribute values – Examples: temperature, height, or weight. – Practically, real values can be measured and represented using a finite number of digits on a computer – Continuous attributes are typically represented as floating-point variables. 14

Asymmetric Binary Attributes l Only presence (a non-zero attribute value) is regarded as important

Asymmetric Binary Attributes l Only presence (a non-zero attribute value) is regarded as important l Examples: – Words present in documents – Items present in customer transactions 15

Exercise l State whether the attributes below are – Nominal, ordinal, interval, or ratio

Exercise l State whether the attributes below are – Nominal, ordinal, interval, or ratio – Qualitative/quantitative – Discrete or continuous l l l Time measured in terms of AM or PM Angles as measured in degrees between 0 and 360 ISBN number for books Shirt size (small, medium, large, X-large) GPA for a particular course (0, 0. 5, 1. 0, …, 4. 0) Overall grade for a particular course (from 0 to 100) 16

What are the Data Mining Tasks? Clu ste rin Data g lin e d

What are the Data Mining Tasks? Clu ste rin Data g lin e d g o M e iv ct i d e Pr t ion cia o s As s le Ru An De oma tec ly tio n Milk 17

Predictive Modeling l The task of predicting the value of an attribute based on

Predictive Modeling l The task of predicting the value of an attribute based on the values of other attributes – The attribute to be predicted is called the target attribute (also known as dependent variable, response variable, or predictand) – The attributes used to make the prediction are called explanatory attributes (also known as independent variables or predictors) l Examples – – – Predicting future price of a stock Predicting the annual rainfall at a location for the next 20 years Predicting whether a customer will buy something at a website Predicting who should be a friend of whom Predicting which web page to display when a user entered a search query 18

Predictive Modeling: Regression l The target attribute to be predicted is quantitative-valued Example: Physiological

Predictive Modeling: Regression l The target attribute to be predicted is quantitative-valued Example: Physiological data from wearable device 19

Predictive Modeling: Classification l The target attribute is nominal-valued – The target attribute is

Predictive Modeling: Classification l The target attribute is nominal-valued – The target attribute is also known as the class Class l Examples – Text categorization, image classification, medical diagnosis, spam detection, intrusion detection 20

Predictive Modeling: Ranking l Target attribute is ordinal-valued 21

Predictive Modeling: Ranking l Target attribute is ordinal-valued 21

Clustering l Find groups of objects such that the objects in the same group

Clustering l Find groups of objects such that the objects in the same group are more similar to each other than objects from other groups l Example applications: – Document clustering, market segmentation, time series clustering 22

Association Rule Mining l Given a set of transactions each of which contain a

Association Rule Mining l Given a set of transactions each of which contain a set of items – Extract association rules that will predict occurrence of an item based on occurrences of other items. Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} 23

Anomaly Detection l Detect significant deviations from normal behavior 24

Anomaly Detection l Detect significant deviations from normal behavior 24

Anomaly Detection: Applications l Credit Card Fraud Detection l Network Intrusion Detection Typical network

Anomaly Detection: Applications l Credit Card Fraud Detection l Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day 25

Exercise l For each data set below, give examples of data mining tasks that

Exercise l For each data set below, give examples of data mining tasks that can be performed on the data. What are the objects and attributes needed to perform the tasks? – US census data, which contains the demographic information for each household (number of people in the household, number of males/females, race, median household income, etc. ) – Climate data, which contains the average daily measurements of temperature, precipitation, sea-level pressure, solar radiation, etc for location in North America – Database of NFL/NBA/MLB/NHL players, teams, and game summaries/statistics 26