Human Animal and Machine Learning Vasile Rus http

  • Slides: 61
Download presentation
Human, Animal, and Machine Learning Vasile Rus http: //www. cs. memphis. edu/~vrus/teaching/cogsci

Human, Animal, and Machine Learning Vasile Rus http: //www. cs. memphis. edu/~vrus/teaching/cogsci

Overview • Announcements • Introduction to Weka

Overview • Announcements • Introduction to Weka

Announcements • Biomedical Term Classfication project • Overfitting vs. overtraining

Announcements • Biomedical Term Classfication project • Overfitting vs. overtraining

Learning With Machines • Learn patterns from data in order to explain that data

Learning With Machines • Learn patterns from data in order to explain that data and make future predictions of it • Input: a set of examples/instances • Output: predictions about new examples/instances

SP-ML Standard Process for Machine Learning

SP-ML Standard Process for Machine Learning

SP-ML: Phases • Domain Understanding Project objectives and requirements understanding, Machine Learning problem definition

SP-ML: Phases • Domain Understanding Project objectives and requirements understanding, Machine Learning problem definition • Data Understanding Initial data collection and familiarization, Data quality problems identification • Data Preparation Table, record, and attribute selection, Data transformation and cleaning • Modeling techniques selection and application, Parameters calibration • Evaluation Business objectives & issues achievement evaluation • Deployment Result model deployment, Repeatable machine learning process implementation

Phases and Tasks Domain Understanding Data Preparation Modeling Evaluation Deployment Determine Project Objectives Collect

Phases and Tasks Domain Understanding Data Preparation Modeling Evaluation Deployment Determine Project Objectives Collect Initial Data Select Modeling Technique Evaluate Results Plan Deployment Assess Situation Describe Data Clean Data Generate Test Design Review Process Plan Monitering & Maintenance Determine Machine Learning Goals Explore Data Construct Data Build Model Determine Next Steps Produce Final Report Produce Project Plan Verify Data Quality Integrate Data Assess Model Format Data Review Project

WEKA • Waikato Environment for Knowledge Analysis (WEKA) Weka – a Flightless Bird Copyright:

WEKA • Waikato Environment for Knowledge Analysis (WEKA) Weka – a Flightless Bird Copyright: Martin Kramer ([email protected] nl)

WEKA • Collection of state-of-the-art machine learning algorithms and data processing tools implemented in

WEKA • Collection of state-of-the-art machine learning algorithms and data processing tools implemented in Java – Released under the GPL • Support for the whole process of experimental machine learning (SP-ML) – Preparation of input data – Statistical evaluation of learning schemes – Visualization of input data and the result of learning • Used for education, research, and applications

WEKA References • Website: http: //www. cs. waikato. ac. nz/~ml/weka • Wiki: http: //weka.

WEKA References • Website: http: //www. cs. waikato. ac. nz/~ml/weka • Wiki: http: //weka. sourceforge. net/wiki/ • Documentation Wiki: http: //weka. sourceforge. net/wekadoc/

Main Features • • 49 data preprocessing tools 76 classification/regression algorithms 8 clustering algorithms

Main Features • • 49 data preprocessing tools 76 classification/regression algorithms 8 clustering algorithms 15 attribute/subset evaluators + 10 search algorithms for feature selection • 3 algorithms for finding association rules • Interface – The Command Line – 3 graphical user interfaces • The Explorer (exploratory data analysis) • The Experimenter (environment for comparing attributes) • The Knowledge. Flow (new process model inspired interface)

Typical Problems • Classification: find the class a new instance belongs to – e.

Typical Problems • Classification: find the class a new instance belongs to – e. g. whether a cell is a normal cell or a cancerous cell • Numeric prediction – Regression: variation of classification where the output is numeric classes – e. g. frequency of cancerous cells found

Typical Problems • Clustering: cluster/group the instances into classes – no class labels are

Typical Problems • Clustering: cluster/group the instances into classes – no class labels are known – e. g. deriving/classify a new disease into different possible types/groups • Association: finding rules/conclusions among attributes – e. g. a high-blood-pressure patient is most likely to have heart-attack disease

Explorer: pre-processing the data • Data can be imported from a file in various

Explorer: pre-processing the data • Data can be imported from a file in various formats: ARFF, CSV, C 4. 5, binary • Data can also be read from a URL or from an SQL database (using JDBC) • Pre-processing tools in WEKA are called “filters” • WEKA contains filters for: – Discretization, normalization, resampling, attribute selection, transforming and combining attributes, …

Attribute-Relation File Format (ARFF) • WEKA’s file format “ARFF” was created by Andrew Donkin

Attribute-Relation File Format (ARFF) • WEKA’s file format “ARFF” was created by Andrew Donkin – ARFF was rumored to stand for Andrew’s Ridiculous File Format

Weather Problem % ARFF file for the weather data with some % numeric features

Weather Problem % ARFF file for the weather data with some % numeric features @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no}

Weather Problem @data % % 14 instances % sunny, 85, FALSE, no sunny, 80,

Weather Problem @data % % 14 instances % sunny, 85, FALSE, no sunny, 80, 90, TRUE, no overcast, 83, 86, FALSE, yes rainy, 70, 96, FALSE, yes rainy, 68, 80, FALSE, yes rainy, 65, 70, TRUE, no overcast, 64, 65, TRUE, yes sunny, 72, 95, FALSE, no sunny, 69, 70, FALSE, yes rainy, 75, 80, FALSE, yes sunny, 75, 70, TRUE, yes overcast, 72, 90, TRUE, yes overcast, 81, 75, FALSE, yes rainy, 71, 91, TRUE, no

Comma-Separated Value (CSV) • Most spreadsheet and database programs allow you to export data

Comma-Separated Value (CSV) • Most spreadsheet and database programs allow you to export data in CSV format • All you need to do is to use a text editor to – add dataset’s name using @relation, – the attribute info using @attribute, and – a @data line • According to the textbook WEKA should automatically convert a CSV file into an ARFF file

Attributes • Nominal/categorical • Numeric • String – in an instance use quotation marks

Attributes • Nominal/categorical • Numeric • String – in an instance use quotation marks to specify values • Date – MM-DDTHH: mm: ss

Irises Problem • One of the most famous datasets used in machine learning •

Irises Problem • One of the most famous datasets used in machine learning • Dates back to seminal work by statistician R. A. Fisher • Three types of plants: Iris setosa, Iris versicolor and Iris virginica • Four attributes: sepal length, sepal width, petal length, and petal width – All measured in cm

Explorer: building “classifiers” • Classifiers in WEKA are models for predicting nominal (or numeric

Explorer: building “classifiers” • Classifiers in WEKA are models for predicting nominal (or numeric quantities - regression) • Implemented learning schemes include: – Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, … • “Meta”-classifiers include: – Bagging, boosting, stacking, error-correcting output codes, locally weighted learning, …

Decision Trees • Decision tree is a predictive model – a mapping of observations

Decision Trees • Decision tree is a predictive model – a mapping of observations about an item to conclusions about the item's target value – Each interior node corresponds to a variable – an arc to a child represents a possible value of that variable – A leaf represents the predicted value of target variable given the values of the variables represented by the path from the root

Decision Trees • The machine learning technique for inducing a decision tree from data

Decision Trees • The machine learning technique for inducing a decision tree from data is called decision tree learning, or (colloquially) decision trees

Explorer: clustering data • WEKA contains “clusterers” for finding groups of similar instances in

Explorer: clustering data • WEKA contains “clusterers” for finding groups of similar instances in a dataset • Implemented schemes are: – k-Means, EM, Cobweb, X-means, Farthest. First • Clusters can be visualized and compared to “true” clusters (if given) • Evaluation based on loglikelihood if clustering scheme produces a probability distribution

Explorer: finding associations • WEKA contains an implementation of the Apriori algorithm for learning

Explorer: finding associations • WEKA contains an implementation of the Apriori algorithm for learning association rules – Works only with discrete data • Can identify statistical dependencies between groups of attributes: – milk, butter bread, eggs (with confidence 0. 9 and support 2000) • Apriori can compute all rules that have a given minimum support and exceed a given confidence

Explorer: attribute selection • Panel that can be used to investigate which (subsets of)

Explorer: attribute selection • Panel that can be used to investigate which (subsets of) attributes are the most predictive ones • Attribute selection methods contain two parts: – A search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking – An evaluation method: correlation-based, wrapper, information gain, chi-squared, … • Very flexible: WEKA allows (almost) arbitrary combinations of these two

Explorer: data visualization • Visualization very useful in practice: e. g. helps to determine

Explorer: data visualization • Visualization very useful in practice: e. g. helps to determine difficulty of the learning problem • WEKA can visualize single attributes (1 -d) and pairs of attributes (2 -d) – To do: rotating 3 -d visualizations (Xgobi-style) • Color-coded class values • “Jitter” option to deal with nominal attributes (and to detect “hidden” data points) • “Zoom-in” function

Performing experiments • Experimenter makes it easy to compare the performance of different learning

Performing experiments • Experimenter makes it easy to compare the performance of different learning schemes • For classification and regression problems • Results can be written into file or database • Evaluation options: cross-validation, learning curve, hold-out • Can also iterate over different parameter settings • Significance-testing built in!

Summary • Intro to Weka

Summary • Intro to Weka

Next Time • Concept Learning

Next Time • Concept Learning