Classification Prediction Classification Supervised Learning Use a training

  • Slides: 20
Download presentation
Classification / Prediction

Classification / Prediction

Classification “Supervised Learning” Use a “training set” of examples to create a model that

Classification “Supervised Learning” Use a “training set” of examples to create a model that is able to predict, given an unknown sample, which of two or more classes that sample belongs to. ?

What we’ll cover • How to build a classifier. • How to evaluate a

What we’ll cover • How to build a classifier. • How to evaluate a classifier. • Using Gene. Pattern to classify expression data.

What Is a Classifier • A predictive rule that uses a set of inputs

What Is a Classifier • A predictive rule that uses a set of inputs (genes) to predict the values of the output (phenotype). • Known examples (train data) are used to build the predictive rule. • Goal: – Achieve high generalization (predictive) power. – Avoid over-fitting.

Classification Computational methodology Data Known Classes Assess Gene-Class Correlation Feature (Marker) Selection Build Classifier

Classification Computational methodology Data Known Classes Assess Gene-Class Correlation Feature (Marker) Selection Build Classifier Test Classifier by Cross. Validation Evaluate Classifier on Independent Test Set

Classification Computational methodology Expression Data Known Classes Assess Gene-Class Correlation Feature Selection Regression Trees

Classification Computational methodology Expression Data Known Classes Assess Gene-Class Correlation Feature Selection Regression Trees (CART) Weighted Voting Build Classifier k-Nearest Neighbors Support Vector Machines Test Classifier by Cross. Validation Evaluate Classifier on Independent Test Set

Classifiers Important issues: – Few cases, many variables (genes) – redundancy: many highly correlated

Classifiers Important issues: – Few cases, many variables (genes) – redundancy: many highly correlated genes. – noise: measurements are very imprecise. – feature selection: reducing p is a necessity. Avoid over-fitting.

K-nn Classifier Example: K=5, 2 genes, 2 classes project samples in gene space gene

K-nn Classifier Example: K=5, 2 genes, 2 classes project samples in gene space gene 2 class black class orange gene 1

K-nn Classifier Example: K=5, 2 genes, 2 classes project unknown sample gene 2 class

K-nn Classifier Example: K=5, 2 genes, 2 classes project unknown sample gene 2 class black class orange ? gene 1

K-nn Classifier Example: K=5, 2 genes, 2 classes class black "consult" 5 closest neighbors:

K-nn Classifier Example: K=5, 2 genes, 2 classes class black "consult" 5 closest neighbors: - 3 black - 2 orange Distance measures: • Euclidean distance gene 2 • 1 -Pearson correlation • … class orange ? gene 1

Support Vector Machine (SVM) Noble, Nat Biotech 2006

Support Vector Machine (SVM) Noble, Nat Biotech 2006

Weighted Voting new sample Mixture of Experts approach: – Each gene casts a vote

Weighted Voting new sample Mixture of Experts approach: – Each gene casts a vote for one of the possible classes. – The vote is weighted by a score assessing the reliability of the expert (in this case, the gene). – The class receiving the highest vote will be the predicted class. – The vote can be used as a proxy for the probability of class membership (prediction strength). Class 1 centroid g 1 g 2 … gi … gn-1 gn Class 2 centroid ? g 1 g 2 … gi … gn 1 gn Slonim et al. , RECOMB 2000

Class Prediction Computational methodology Expression Data Known Classes Assess Gene-Class Correlation Feature Selection Build

Class Prediction Computational methodology Expression Data Known Classes Assess Gene-Class Correlation Feature Selection Build Predictor Test Predictor by Cross. Validation Evaluate Predictor on Independent Test Set Regression Trees (CART) Weighted Voting k-Nearest Neighbors Support Vector Machines

Testing the Classifier • Evaluation on independent test set – Build the classifier on

Testing the Classifier • Evaluation on independent test set – Build the classifier on the train set – Assess prediction performance on test set. • Maximize generalization/Avoid overfitting. • Performance measure error rate = # of cases correctly classified total # of cases

Testing the Classifier • Evaluation on independent test set – What if we don’t

Testing the Classifier • Evaluation on independent test set – What if we don’t have an independent test set? • Cross Validation (XV): – Split the dataset into n folds (e. g. , 10 folds of 10 cases each). – For each fold (e. g. , for each group of 10 cases), • train (i. e. , build model) on n-1 folds (e. g. , on 90 cases), • test (i. e. , predict) on left-out fold (e. g. , on remaining 10 cases). – Combine test results. – Frequently, leave-one-out XV (when small sample size).

Testing the Classifier Learning curves – leave one out cross validation ALL vs. MLL

Testing the Classifier Learning curves – leave one out cross validation ALL vs. MLL vs. AML

Testing the Classifier • Error rate estimate: – Evaluate on independent test set: •

Testing the Classifier • Error rate estimate: – Evaluate on independent test set: • Best error estimate. – Cross Validation • Needed when small sample size and for model selection.

Classification Cookbook • Start by splitting data into train and test set (stratified). “Forget”

Classification Cookbook • Start by splitting data into train and test set (stratified). “Forget” about the test set until the very end. • Explore different feature selection methods and different classifiers on train set by XV. • Once the “best” classifier and best classifier parameters have been selected (based on XV) – Build a classifier with given parameters on entire train set. – Apply classifier to test set.

Classification Gene. Pattern methods • Split data into train and test set – Split.

Classification Gene. Pattern methods • Split data into train and test set – Split. Dataset. Train. Test • Explore different feature selection methods and different classifiers on train set by CV. – CARTXValidation – KNNXValidation – Weighted. Voting. XValidation • Once the “best” classifier and best classifier parameters have been selected (based on CV) – CART – KNN – Weighted. Voting • Examine results: – Prediction. Reults. Viewer – Feature. Summary. Viewer

Classification Example

Classification Example