An Exercise in Machine Learning http www cs

  • Slides: 29
Download presentation
An Exercise in Machine Learning http: //www. cs. iastate. edu/~cs 573 x/BBSIlab/2006/ Cornelia Caragea

An Exercise in Machine Learning http: //www. cs. iastate. edu/~cs 573 x/BBSIlab/2006/ Cornelia Caragea

Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results

Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results

Machine Learning Software n Suites (General Purpose) n n n Specific n n WEKA

Machine Learning Software n Suites (General Purpose) n n n Specific n n WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various) Classification: C 4. 5, SVMlight Association Rule Mining Bayesian Net … Commercial vs. Free

What does WEKA do? Implementation of the state-of-the-art learning algorithm n Main strengths in

What does WEKA do? Implementation of the state-of-the-art learning algorithm n Main strengths in the classification n Regression, Association Rules and clustering algorithms n Extensible to try new learning schemes n Large variety of handy tools (transforming datasets, filters, visualization etc…) n

WEKA resources n n API Documentation, Tutorials, Source code. WEKA mailing list Data Mining:

WEKA resources n n API Documentation, Tutorials, Source code. WEKA mailing list Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations Weka-related Projects: n n Weka-Parallel - parallel processing for Weka RWeka - linking R and Weka YALE - Yet Another Learning Environment Many others…

Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results

Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results

Preparing Data ARFF Data Format n Header – describing the attribute types n Data

Preparing Data ARFF Data Format n Header – describing the attribute types n Data – (instances, examples) commaseparated list n

Launching WEKA n java -jar weka. jar

Launching WEKA n java -jar weka. jar

Load Dataset into WEKA

Load Dataset into WEKA

Data Filters n n Useful support for data preprocessing Removing or adding attributes, resampling

Data Filters n n Useful support for data preprocessing Removing or adding attributes, resampling the dataset, removing examples, etc. Creates stratified cross-validation folds of the given dataset, and class distributions are approximately retained within each fold. Typically split data as 2/3 in training and 1/3 in testing

Data Filters

Data Filters

Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results

Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results

Building Classifiers A classifier model - mapping from dataset attributes to the class (target)

Building Classifiers A classifier model - mapping from dataset attributes to the class (target) attribute. Creation and form differs. n Decision Tree and Naïve Bayes Classifiers n Which one is the best? n n No Free Lunch!

Building Classifiers

Building Classifiers

(1) weka. classifiers. rules. Zero. R n n n Class for building and using

(1) weka. classifiers. rules. Zero. R n n n Class for building and using a 0 -R classifier Majority classifier Predicts the mean (for a numeric class) or the mode (for a nominal class)

Exercise 1 n http: //www. cs. iastate. edu/~cs 573 x/BBSIlab/2006/ exercises/ex 1. html

Exercise 1 n http: //www. cs. iastate. edu/~cs 573 x/BBSIlab/2006/ exercises/ex 1. html

(2)weka. classifiers. bayes. Naive. Bayes n Class for building a Naive Bayes classifier

(2)weka. classifiers. bayes. Naive. Bayes n Class for building a Naive Bayes classifier

(3) weka. classifiers. trees. J 48 n Class for generating a pruned or unpruned

(3) weka. classifiers. trees. J 48 n Class for generating a pruned or unpruned C 4. 5 decision tree

Test Options Percentage Split (2/3 Training; 1/3 Testing) n Cross-validation n estimating the generalization

Test Options Percentage Split (2/3 Training; 1/3 Testing) n Cross-validation n estimating the generalization error based on resampling when limited data; averaged error estimate. n stratified n 10 -fold n leave-one-out (Loo) n

Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results

Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results

Understanding Output

Understanding Output

Decision Tree Output (1)

Decision Tree Output (1)

Decision Tree Output (2)

Decision Tree Output (2)

Exercise 2 n http: //www. cs. iastate. edu/~cs 573 x/BBSIlab/ 2006/exercises/ex 2. html

Exercise 2 n http: //www. cs. iastate. edu/~cs 573 x/BBSIlab/ 2006/exercises/ex 2. html

Performance Measures n n n Accuracy & Error rate Confusion matrix – contingency table

Performance Measures n n n Accuracy & Error rate Confusion matrix – contingency table True Positive rate & False Positive rate (Area under Receiver Operating Characteristic) Precision, Recall & F-Measure Sensitivity & Specificity For more information on these, see n uisp 09 -Evaluation. ppt

Decision Tree Pruning Overcome Over-fitting n Pre-pruning and Post-pruning n Reduced error pruning n

Decision Tree Pruning Overcome Over-fitting n Pre-pruning and Post-pruning n Reduced error pruning n Subtree raising with different confidence n Comparing tree size and accuracy n

Subtree replacement n Bottom-up: tree is considered for replacement once all its subtrees have

Subtree replacement n Bottom-up: tree is considered for replacement once all its subtrees have been considered

Subtree Raising Deletes node and redistributes instances n Slower than subtree replacement n

Subtree Raising Deletes node and redistributes instances n Slower than subtree replacement n

Exercise 3 n http: //www. cs. iastate. edu/~cs 573 x/BBSIlab/ 2006/exercises/ex 3. html

Exercise 3 n http: //www. cs. iastate. edu/~cs 573 x/BBSIlab/ 2006/exercises/ex 3. html