Classifying Lymphoma Dataset Using Multiclass Support Vector Machines






























- Slides: 30
Classifying Lymphoma Dataset Using Multi-class Support Vector Machines INFS-795 Advanced Data Mining Prof. Domeniconi Presented by Hong Chai 1
Agenda (1) Lymphoma Dataset Description (2) Data Preprocessing - Formatting - Dealing with Missing Values - Gene Selections (3) Multi-class SVM Classification - 1 -against-all - 1 -against-1 (4) Tools (5) References 2
Lymphoma Dataset • Alizadeh et al. (2000), Distinct Types of Diffuse Large Bcell Lymphoma Identified by Gene Expression Profiling • Publicly available at http: //llmpp. nih. gov/lymphoma/ • In microarray data, Expression profiling of genes are measured in rows Samples are columns 3
Lymphoma Dataset • 96 samples of lymphocytes (instances) • 4026 human genes (features) • 9 classes of lymphoma: DLBCL, GCB, NIL, ABB, RAT, TCL, FL, RBB, CLL • Glimpse of data DLCL- DLCL 0042 0031 DLCL -0036 CLL 60 CLL 68 FL-10 FL-11 GCB NILIg. M CLL 65 GENE 2406 X 0. 75 0. 12 0. 28 0. 58 0. 37 -1. 2 0. 37 -0. 6 0. 37 0. 12 GENE 3689 X 0. 01 -0. 4 -0. 6 0. 37 0. 09 0. 78 -0. 8 2. 45 -0. 45 0. 37 GENE 3133 X 1. 43 -0. 8 0. 37 -0. 5 0. 37 1. 15 0. 37 -1. 29 1. 26 0. 67 GENE 1008 X 0. 05 0. 64 0. 37 0. 12 0. 63 0. 35 0. 12 0. 37 -2. 29 -0. 8 4
Lymphoma Dataset 5
Goal Task: classification • Assign each patient sample to one of 9 categories, e. g. Diffuse Large B-cell Lymphoma (DLBCL) or Chronic Lymphocytic Leukemia (CLL). • Microarray data classification: an alternative to current malignancies classification that relies on morphological or clinical variables Medical implications • Precise categorization of cancers; more relevant diagnosis • More accurate assignment of cases to high risk or low risk categories • more targeted therapies • Improved predictability of outcome. 6
Data Preprocessing Missing Values Imputation • • • 3% of gene expression profiles data are missing 1980 of the 4026 genes have missing values 49. 1% of genes (features) involved Some of these genes may be highly informative for classification Need to deal with missing values before applying to SVM 7
Missing Value Approaches • Instance or feature deletion - if dataset large enough & does not distort distribution • Replace with a randomly drawn observed value - proved to work well (http: //globain/cse/psu. edu/courses/spring 2003/3 -norm-val. pdf) • EM algorithm • Global mode or mean substitution - will distort distribution • Local mode or mean substitution with KNN algorithm (Prof. Domeniconi) 8
Local Mean Imputation (KNN) 1. Partition the data set D into two sets. • Let the first set, Dm, contain instances with missing value(s). • The other set, Dc, contains instances with complete values. 2. For each instance vector x Dm • Divide the vector into observed and missing parts as x = [xo; xm]. • Calculate the distance between xo and every instance y Dc, using only those features that are observed in x. • From the K closest y’s (instances in Dc), calculate the mean of the feature for which x has missing value(s). Make substitution with this local mean. (Note: for nominal features use mode. n/a in microarray data) 9
Data Preprocessing Feature Selection: Motivations - Number of features large, instances small - Reduce dimensionality to overcome overfitting - A small number of discriminant “marker” genes may characterize different cancer classes Example: Guyon et al. identified 2 genes that yield zero leaveone-out error in the leukemia dataset, 4 genes in the colon cancer dataset that give 98% accuracy. (Guyon et al. Gene Selection for Cancer Classification using SVM, 2002) 10
Feature Selection Discriminant Score Ranking • Which gene is more informative in the 2 -class case: + Gene 1 + Gene 2 11
Separation Score • Gene 1 more discriminant. Criteria: - Large difference of μ+ and μ- Small variance within each class • Score function F(gj) = | (μj+ - μj-) / (σj+ + σj-) | 12
Separation Score • In multi-class cases, rank genes that are discriminant among multiple classes C 1 C 2 Δ C 3 • A gene may functionally relates to several cancer classes such as C 1 and C 2 13
Separation Score • Proposing an adapted score function For each gene j Calculate μi in each class Ci Sort μi in descending order Find a cutoff point with largest diff(μi, μj) μexp-cutoff-left σ+ σexp-cutoff-left μ- μexp-cutoff-right σ- σexp-cutoff-right μ+ F(gj) = | (μj+ - μj-) / (σj+ + σj-) | Rank genes by F(gj) Select top genes via threshold 14
Separation Score Disadvantage: • Does not yield more compact gene sets; still abundant • Does not consider mutual information between genes 15
Feature Selection Recursive Feature Elimination/SVM 1. In the linear SVM model on the full feature set Sign (w • x + b) w is a vector of weights for each feature, x is an input instance, and b a threshold. If wi = 0, feature Xi does not influence classification and can be eliminated from the set of features. 16
RFE/SVM 2. When w is computed for the full feature set, sort features according in descending order of weights. The lower half is eliminated. 3. A new linear SVM is built using the new set of features. Repeat the process until the set of predictors is non-divisible by two. 4. The best feature subset is chosen. 17
Feature Selection • PCA comment: not common in microarray data. • Disadvantage: none of original inputs can be discarded • We want to retain a minimum subset of informative genes to achieve best classification performance. 18
Multi-class SVM 19
Multi-class SVM Approaches 1 -against-all • Each of the SVMs separates a single class from all remaining classes (Cortes and Vapnik, 1995) 1 -against-1 • Pair-wise. k(k-1)/2, k Y SVMs are trained. Each SVM separates a pair of classes (Fridman, 1996) Performance similar in some experiments (Nakajima, 2000) Time complexity similar: k evaluation in 1 -all, k-1 in 1 -1 20
1 -against- All • Or “one-against-rest”, a tree algorithm • Decomposed to a collection of binary classifications • k decision functions, one for each class (wk)T • x+bk, k Y • The kth classifier constructs a hyperplane between class n and the k-1 other classes Class of x = argmaxi{(wi)T • (x)+bi} 21
1 -against- 1 • k(k-1)/2 classifiers where each one is trained on data from two classes • For training data from ith and jth classes, run binary classification • Voting strategy: If Sign(wij)T • x+bij) says x is in class i, then add 1 to class i. Else to class j. • Assign x to class with largest vote (Max wins) 22
Kernels to Experiment • Polynomial kernels K(Xi, Xj)=(Xi. Xj+1)^d • Gaussian Kernels K(Xi, Xj)=e^(-|| Xi - Xj ||/σ^2) 23
SVM Tools - Weka Data Preprocessing • To ARFF format • Import file 24
SVM Tools - Weka Feature Selection using SVM • Select Attribute • SVMAttribute. Eval 25
SVM Tools - Weka Multi-classifier • Classify • Meta • Multi. Classifier (Handles multi-class datasets with 2 -classifiers) 26
SVM Tools - Weka • Multi-class SVM • Classify • Functions • SMO (Weka’s SVM) 27
SVM Tools - Weka • Multi-class SVM Options • Method 1 -against-1 1 -against-all • Kernel options not found 28
Multi-class SVM Tools Other Tools include • SVMTorch (1 -against-all) • Lib. SVM (1 -against-1) • Light. SVM 29
References • Alizadeh et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, 1999 • Cristianini, An Introduction to Support Vector Machines, 2000 • Dor et al, Scoring Genes for Relevance, 2000 • Franc and Hlavac, Multi-class Support Vector Machines • Furey et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data, 2000 • Guyon et al. Gene Selection for Cancer Classification using Support Vector Machines, 2002 • Selikoff, The SVM-Tree Algorithm, A New Method for Handling Multiclass SVM, 2003 • Shipp et al. Diffuse Large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning, 2002 • Weston, Multi-class Support Vector Machines, Technical Report, 1998 30