CSCE 555 Bioinformatics 15 classification for microarray data
CSCE 555 Bioinformatics 15 classification for microarray data Lecture Meeting: MW 4: 00 PM-5: 15 PM SWGN 2 A 21 Instructor: Dr. Jianjun Hu Course page: University of South Carolina http: //www. scigen. org/csce 555 Department of Computer Science and Engineering 2008 www. cse. sc. edu.
Outline Classification problem in microarray data Classification concepts and algorithms Evaluation of classification algorithms Summary 12/16/2021 2
Learning set Predefine classes Clinical outcome Bad prognosis recurrence < 5 yrs Good Prognosis recurrence > 5 yrs Good Prognosis ? Matesis > 5 Objects Array Feature vectors Gene expression new array Reference L van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan. . Classification rule Lab 2. 3 3
Learning set Predefine classes Tumor type B-ALL T-ALL AML T-ALL ? Objects Array Feature vectors Gene expression Reference Golub et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531 -537. new array Classification Rule Lab 2. 3 4
Classification/Discrimination Y Normal Cancer sample 1 sample 2 sample 3 sample 4 sample 5 … 1 2 3 4 5 0. 46 -0. 10 0. 15 -0. 45 -0. 06 0. 30 0. 49 0. 74 -1. 03 1. 06 0. 80 0. 24 0. 04 -0. 79 1. 35 1. 51 0. 06 0. 10 -0. 56 1. 09 X 0. 90 0. 46 0. 20 -0. 32 -1. 09 . . . . unknown =Y_new New sample 0. 34 0. 43 -0. 23 -0. 91 1. 23 X_new Each object (e. g. arrays or columns)associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X 1, …, XG) Aim: predict Y_new from X_new.
Discrimination/Classification Lab 2. 3 6
Basic principles of discrimination • Each object associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X 1, …, XG) Aim: predict Y from X. 1 K 2 Predefined Class {1, 2, …K} Objects Y = Class Label = 2 Classification rule ? X = {red, square} Y=? X = Feature vector {colour, shape} Lab 2. 3 7
KNN: Nearest neighbor classifier Based on a measure of distance between observations (e. g. Euclidean distance or one minus correlation). k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows: ◦ find the k observations in the learning set closest to X ◦ predict the class of X by majority vote, i. e. , choose the class that is most common among those k observations. The number of neighbors k can be chosen by cross-validation (more on this later). Lab 2. 3 8
3 -Nearest Neighbors query point qf 3 nearest neighbors 2 x, 1 o 9
Limitation of KNN: what is K?
SVM: Support Vector Machines SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. In order to discriminate between two classes, given a training dataset ◦ Map the data to a higher dimension space (feature space) ◦ Separate the two classes using an optimal linear separator 11
Key Ideas of SVM: Margins of Linear Separators Maximum margin linear classifier 12
Optimal hyperplane Support vectors uniquely characterize optimal hyper-plane ρ margin Optimal hyper-plane Support vector 13
Finding the Support Vectors Lagrangian multiplier method for constrained opt Inner product of vectors
Key Ideas of SVM: Feature Space Mapping Map the original data to some higherdimensional feature space where the training set is linearly separable: Φ: x → φ(x) (x 1, x 2, x 1^2, x 2^2, x 1*x 2, …) 15
The “Kernel Trick” The linear classifier relies on inner product between vectors K(xi, xj)=xi. Txj If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes: K(xi, xj)= φ(xi) Tφ(xj) A kernel function is some function that corresponds to an inner product in some expanded feature space. Example: 2 -dimensional vectors x=[x 1 x 2]; let K(xi, xj)=(1 + xi. Txj)2, Need to show that K(xi, xj)= φ(xi) Tφ(xj): K(xi, xj)=(1 + xi. Txj)2, = 1+ xi 12 xj 12 + 2 xi 1 xj 1 xi 2 xj 2+ xi 22 xj 22 + 2 xi 1 xj 1 + 2 xi 2 xj 2= = [1 xi 12 √ 2 xi 1 xi 22 √ 2 xi 1 √ 2 xi 2]T [1 xj 12 √ 2 xj 1 xj 22 √ 2 xj 1 √ 2 xj 2] = 16
Examples of Kernel Functions n Linear: K(xi, xj)= xi Txj n Polynomial of power p: K(xi, xj)= (1+ xi Txj)p n Gaussian (radial-basis function network): n Sigmoid: K(xi, xj)= tanh(β 0 xi Txj + β 1)
SVM Advantages: ◦ maximize the margin between two classes in the feature space characterized by a kernel function ◦ are robust with respect to high input dimension Disadvantages: ◦ difficult to incorporate background knowledge ◦ Sensitive to outliers 18
Variable/Feature Selection with SVMs Recursive Feature Elimination ◦ Train a linear SVM ◦ Remove the variables with the lowest weights (those variables affect classification the least), e. g. , remove the lowest 50% of variables ◦ Retrain the SVM with remaining variables and repeat until classification is reduced Very successful Other formulations exist where minimizing the number of variables is folded into the optimization problem Similar algorithm exist for non-linear SVMs Some of the best and most efficient variable selection methods 19
Software A list of SVM implementation can be found at http: //www. kernelmachines. org/software. html Some implementation (such as LIBSVM) can handle multi-classification SVMLight, Lib. SVM are among one of the earliest implementation of SVM Several Matlab toolboxes for SVM are also available 20
How to Use SVM to Classify Microarray Data Prepare the data format for Lib. SVM Usage: svm-train [options] training_set_file [model_file] Examples of options: -s 0 -c 10 t 1 -g 1 -r 1 -d 3 Usage: svm-predict [options] test_file model_file output_file Labels Index of nonzero features value of nonzero features <label> <index 1>: <value 1> <index 2>: <value 2>. . .
Decision tree classifiers G 1 0. 1 G 2 0. 3 G 3 … … Class 0 -0. 2 0. 4 … 1 Gene 1 Mi 1 < -0. 67 0. 3 0. 4 yes 0 no Gene 2 Mi 2 > 0. 18 2 yes no 0 1 Advantage: transparent rules, easy to interpret 22 0. 18
Ensemble classifiers Resample 1 Classifier 1 Resample 2 Classifier 2 Training Set X 1, X 2, … X 100 Aggregate classifier Resample 499 Resample 500 Classifier 499 Examples: Bagging Boosting Random Forest Classifier 500 23
Aggregating classifiers: Bagging Test sample Resample 1 X*1, X*2, … X*100 Tree 1 Class 1 Resample 2 X*1, X*2, … X*100 Tree 2 Class 2 Lets the tree vote Training Set (arrays) X 1, X 2, … X 100 90% Class 1 10% Class 2 Resample 499 X*1, X*2, … X*100 Tree 499 Class 1 Resample 500 X*1, X*2, … X*100 Tree 500 Class 1 24
Weka Data Mining Toolbox Weka Package (java) includes: ◦ All previous classifiers ◦ Neural networks ◦ Projection pursuit ◦ Bayesian belief networks ◦ And More 25
Feature Selection in Classification What: select a subset of features Why: ◦ Lead to better classification performance by removing variables that are noise with respect to the outcome ◦ May provide useful insights into the biology ◦ Can eventually lead to the diagnostic tests (e. g. , “breast cancer chip”) 26
Classifier Performance assessment Any classification rule needs to be evaluated for its performance on the future samples. It is almost never the case in microarray studies that a large independent population-based collection of samples is available at the time of initial classifierbuilding phase. One needs to estimate future performance based on what is available: often the same set that is used to build the classifier. Assessing performance of the classifier based on ◦ Cross-validation. ◦ Test set ◦ Independent testing on future dataset 27
Diagram of performance assessment Classifier Training Set Resubstitution estimation Performance assessment Training set Classifier Independent test set Test set estimation
Diagram of performance assessment Classifier Training Set Resubstitution estimation (CV) Learning set Training set Classifier Cross Validation Performance assessment (CV) Test set Classifier Independent test set Test set estimation
Performance assessment V-fold cross-validation (CV) estimation: Cases in learning set randomly divided into V subsets of (nearly) equal size. Build classifiers by leaving one set out; compute test set error rates on the left out set and averaged. ◦ Bias-variance tradeoff: smaller V can give larger bias but smaller variance ◦ Computationally intensive. Leave-one-out cross validation (LOOCV). (Special case for V=n). Works well for stable classifiers (k. NN, LDA, SVM) Supplementary slide Lab 2. 3 30
Which to use depends mostly on sample size If the sample is large enough, split into test and train groups. If sample is barely adequate for either testing or training, use leave one out In between consider V-fold. This method can give more accurate estimates than leave one out, but reduces the size of training set.
Summary Microarray Classification Task Classifiers: KNN, SVM, Decision Tree, Weka, Lib. SVM Classifier evaluation, cross-validation
Acknowledgement Terry Speed Jean Yee Hwa Yang Jane Fridlyand
- Slides: 33