Classification and Prediction Classification Regression and Prediction u

  • Slides: 10
Download presentation
Classification and Prediction

Classification and Prediction

Classification, Regression, and Prediction u Classification: l Predict categorical class labels l u Classify

Classification, Regression, and Prediction u Classification: l Predict categorical class labels l u Classify data (constructs a model) based on training set and values (class labels) in a classifying attribute and uses it in classifying new data Regression: l Model continuous -valued functions; i. e. , predicts unknown or missing values u Prediction: l l Classification + Regression Sometimes refers only to regression (e. g. , in the text book) 2

Classification—A Two-Step Process u Step 1. Model construction: describing a set of predetermined classes

Classification—A Two-Step Process u Step 1. Model construction: describing a set of predetermined classes l Set of tuples used for model construction: training set l l Each tuple /sample is assumed to belong to a predefined class, as determined by class label attribute Model is represented as classification rules, decision trees, or mathematical formulae IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 3

Classification—A Two-Step Process u u Step 2. Model usage: for classifying future or unknown

Classification—A Two-Step Process u u Step 2. Model usage: for classifying future or unknown objects Estimate predictive accuracy of model l Known label of test sample is compared with classified result from model Accuracy rate is percentage of test set samples that are correctly classified by model Test set is independent of training set, otherwise overfitting will occur IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ (Jeff, Professor, 4) 4

Classification Process (1): Model Construction Training Data Classification Algorithms Classifier (Model) IF rank =

Classification Process (1): Model Construction Training Data Classification Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 5

Classification Process (2): Use Model in Prediction Classifier (Model) Unseen Data IF rank =

Classification Process (2): Use Model in Prediction Classifier (Model) Unseen Data IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Test Data (Jeff, Professor, 4) Tenured? 6

Supervised versus Unsupervised Learning u Supervised learning (classification) l l u Supervision: Training data

Supervised versus Unsupervised Learning u Supervised learning (classification) l l u Supervision: Training data (observations, measurements, etc. ) are accompanied by labels indicating the class of the observations New data is classified based on training set Unsupervised learning (clustering) l l Class labels of training data are unknown Given a set of measurements, observations, etc. , need to establish existence of classes or clusters in data 7

Classification and Prediction u u u u u What is classification? What is prediction?

Classification and Prediction u u u u u What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary 8

Issues (1): Data Preparation u Data cleaning l l u Relevance analysis (feature selection)

Issues (1): Data Preparation u Data cleaning l l u Relevance analysis (feature selection) l u Preprocess data in order to reduce noise (e. g. , by smoothing) and handle missing values (e. g. , use most commonly occurring value) Help to reduce confusion during learning Remove irrelevant or redundant attributes Data transformation l Generalize (to higher level concepts) and/or normalize data (scaling values so that they fall within specified range) 9

Issues (2): Evaluating Classification Methods u Predictive accuracy l u l Time to construct

Issues (2): Evaluating Classification Methods u Predictive accuracy l u l Time to construct model Time to use model Robustness l u Predict class label Interpretability: l Speed l u u Make correct prediction given noise and missing values u Understanding and insight provided by model Goodness of rules l l Decision tree size Compactness of classification rules Scalability l Construct model efficiently given data size 10