Supervised gene expression data analysis using SVMs and

Outline A real problem: Lymphoma gene expression data analysis by machine learning methods: •

DNA microarray DNA hybridization microarrays supply information about gene expression through measurements of m.

A DNA microarray image (E. coli) • Each spot corresponds to the expression level

Analyzing microarray data by machine learning methods The large amount of gene expression data

A real problem: A gene expression analysis of lymphoma Biological problems Machine learning

The data • Data of a specialized DNA microarray, named "Lymphochip", developed at the

Types of lymphoma Three main classes of lymphoma: • Diffuse Large B-Cell Lymphoma (DLBCL),

The first problem: Separating normal from cancerous tissues. Our first task consists in distinguishing

Supervised approaches to molecular classification of diseases Several supervised methods have been applied to

Why using Support Vector Machines ? “General” motivations “Specific” motivations • SVM are two-classifiers

SVM to classify cancerous and normal cells We consider 3 standard SVM kernels: •

Results Learning machine model Gen. error St. dev. Prec. Sens. SVM-linear 1. 04 3.

ROC analysis • The ROC curve of the SVM-linear is ideal • The polynomial

Summary of the results on the first problem Using hierarchical clustering 14, 6% of

The second problem: Identifying DLBCL subgroups It starts from an hypothesis of Alizadeh et

A feature selection approach based on “a priori” knowledge Finding the most correlated genes

An heuristic method (1) A two-stage approach: I. Select groups of coordinately expressed genes.

An heuristic method (2) I. Selecting groups of coordinately expressed genes: • Use “a

Applying the heuristic method 1. Selecting “candidate” subgroups of genes: We used biological knowledge

GCB signature Learn. machine model Gen. error St. dev. Prec. Sens. SVM-linear 10. 50

The second problem: summary • The results support the hypothesis of Alizadeh about the

Developments I. Methods to discover subclasses of tumors on molecular basis. II. Methods to

Slides: 25

Download presentation

Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini e-mail: valenti@disi. unige. it

Outline A real problem: Lymphoma gene expression data analysis by machine learning methods: • Diagnosis of tumors using a supervised approach • Discovering groups of genes related to carcinogenic processes • Discovering subgroups of diseases using gene expression data.

DNA microarray DNA hybridization microarrays supply information about gene expression through measurements of m. RNA levels of large amounts of genes in a cell They offer a snapshot of the overall functional status of a cell: virtually all differences in cell type or state are related with changes in the m. RNA levels of many genes. DNA microarrays have been used in mutational analyses, genetic mapping studies, in genome monitoring of gene expression, in pharmacogenomics, in metabolic pathway analysis.

A DNA microarray image (E. coli) • Each spot corresponds to the expression level of a particular gene • Red spots correspond to over expressed genes • Green spots to under expressed genes • Yellow spots correspond to intermediate levels of gene expression

Analyzing microarray data by machine learning methods The large amount of gene expression data requires machine learning methods to analyze and extract significant knowledge from DNA microarray data Unsupervised approach No or limited a priori knowledge. Clustering algorithms are used to group together similar expression patterns : grouping sets of genes • grouping different cells or different functional status of the cell. • Example: hierarchical clustering, fuzzy or possibilistic clustering, selforganizing maps. Supervised approach “A priori” biological and medical knowledge on the problem domain. Learning algorithms with labeled examples are used to associate gene expression data with classes: separating normal form cancerous tissues • classifying different classes of cells on functional basis • Prediction of the functional class of unknown genes. • Example: multi-layer perceptrons, support vector machines, decision trees, ensembles of classifiers.

A real problem: A gene expression analysis of lymphoma Biological problems Machine learning methods 1. Separating cancerous and normal tissues using the overall information available. 1. - Support Vector Machines (SVM) : linear, RBF and polynomial kernels - Multi Layer Perceptron (MLP) - Linear Perceptron (LP) 2. Identifying groups of genes specifically related to the expression of two different tumour phenotypes through expression signatures. 2. Two step method: A priori knowledge and unsupervised methods to select “candidate” subgroups SVM or MLP identify the most correlated subgroups

The data • Data of a specialized DNA microarray, named "Lymphochip", developed at the Stanford University School of Medicine: 4026 different genes preferentially expressed in lymphoid cells or with known roles in processes important in immunology or cancer High dimensional data 96 tissue samples from normal and cancerous populations of human lymphocytes Small sample size A challenging machine learning problem

Types of lymphoma Three main classes of lymphoma: • Diffuse Large B-Cell Lymphoma (DLBCL), • Follicular Lymphoma (FL) • Chronic Lymphocytic Leukemia (CLL) • Transformed Cell Lines (TCL) and normal lymphoid tissues Type of tissue Number of samples Normal lymphoid cells 24 DLBCL 46 FL 9 CLL 11 TCL 6

Visualizing data with Tree View

The first problem: Separating normal from cancerous tissues. Our first task consists in distinguishing cancerous from normal tissues using the overall information available, i. e. all the gene expression data. From a machine learning standpoint it is a dichotomic problem. Data characteristics: • Small sample size • High dimension • Missing values • Noise Main applicative goal: Supporting functionalmolecular diagnosis of tumors and polygenic diseases

Supervised approaches to molecular classification of diseases Several supervised methods have been applied to the analysis of c. DNA microarrays and high density oligonucleotide chips: • Decision trees • Fisher linear discriminant • Linear discriminant analysis • Multi-Layer Perceptrons • Parzen windows • Nearest-Neighbours classifiers • Support Vector Machines Proposed by different authors: Golub et al. (1999), Pavlidis et al. (2001), Khan et al. (2001), Furey et al. (2000), Ramaswamy et al. (2001), Yeang et al. (2001), Dudoit et al. (2002).

Why using Support Vector Machines ? “General” motivations “Specific” motivations • SVM are two-classifiers theoretically founded on Vapnik' s Statistical Learning Theory. • Kernel are well-suited to • They act as linear classifiers in a high dimensional feature space originated by a projection of the original input space. • The resulting classifier is in general non linear in the input space. • SVM achieves good generalization performances maximizing the margin between the classes. • SVM learning algorithm has no local minima working with high dimensional data. • Small sample sizes require algorithms with good generalization capabilities. • Automatic diagnosis of tumors requires high sensitivity and very effective classifiers. • SVM can identify mis-labeled data (i. e. incorrect diagnosis). • We could design specific kernel to incorporate “a priori” knowledge about the problem.

SVM to classify cancerous and normal cells We consider 3 standard SVM kernels: • Gaussian • Polynomial Varying: • Values of the kernel parameters • The regularization factor C • Dot-product Comparing them with: • MLP • LP Varying: • Number of hidden units • Backpropagation parameters Estimation of the generalization error through: • 10 -fold crossvalidation • leave-one-out

Results Learning machine model Gen. error St. dev. Prec. Sens. SVM-linear 1. 04 3. 16 98. 63 100. 0 SVM-poly 4. 17 5. 46 94. 74 100. 0 SVM-RBF MLP LP 25. 00 2. 08 9. 38 4. 48 75. 00 100. 0 4. 45 98. 61 10. 24 95. 65 91. 66 • 10 -fold cross-validation ~ leave-one-out estimation of error • SVM-linear achieves the best results. • High sensitivity, no matter what type of kernel function is used. • Radial basis SVM high misclassification rate and high estimated VC dimension

ROC analysis • The ROC curve of the SVM-linear is ideal • The polynomial SVM also achieves a reasonably good ROC curve • The SVM-RBF show a diagonal ROC curve: the highest sensitivity is achieved only when it completely fails to correctly detect normal cells. • The ROC curve of the MLP is also nearly optimal • Linear perceptron shows a worse ROC curve, but with reasonable values lying on the highest and leftmost part of the ROC plane.

Summary of the results on the first problem Using hierarchical clustering 14, 6% of the examples are misclassified (Alizadeh, 2000), against the 1. 04% of the SVM, the 2. 08% of the MLP and the 9. 38% of the LP. Supervised methods exploit a priori biological knowledge (i. e. labeled data), while clustering methods use only gene expression data to group together different tissues, without any labeled data. Linear SVM achieve the best results, but also MLP and 2 nd degree polynomial show a relatively low generalization error. Linear SVM and MLP can be used to build classifiers with a highsensitivity and a low rate of false positives. These results must be considered with caution because the size of the available data set is too small to infer general statements about the performances of the proposed learning machines.

The second problem: Identifying DLBCL subgroups It starts from an hypothesis of Alizadeh et al. about the existence of two distinct functional types of lymphoma inside DLBCL. Actually, we consider two problems: 1. Validation of Alizadeh’s 2. Finding groups of genes hypothesis mostly related to this separation • They identified two subgroups of molecularly distinct DLBCL: germinal centre B-like (GCB-like) and activated B-like cells (AB-like). • These two classes correspond to patients with very different prognosis. Different subsets of genes could be responsible for the distinction of these two DLBCL subgroups: the expression signatures Proliferation, T-cell, Lymphnode and GCB (Lossos, 2000).

A feature selection approach based on “a priori” knowledge Finding the most correlated genes involves an exponential combination of genes (2 n-1), where n is usually of the order of thousands. We need greedy algorithms and heuristic methods. Can we exploit “a priori” biological knowledge about the problem ?

An heuristic method (1) A two-stage approach: I. Select groups of coordinately expressed genes. II. Identify among them the ones mostly correlated to the disease. • We do not consider single genes. • We consider only groups of coordinately expressed genes.

An heuristic method (2) I. Selecting groups of coordinately expressed genes: • Use “a priori” biological and medical knowledge about groups of genes with known or suspected roles in carcinogenic processes II. Identify subgroups of genes mostly related to the disease: 1. Train a set of classifiers using only the subgroups of genes selected in the first stage. 2. Evaluate and rank the performance of the trained classifiers. 3. Select the subgroups by which the corresponding classifiers achieve the best ranking. And/or • Use unsupervised methods such as clustering algorithms to identify coordinately expressed sets of genes

Applying the heuristic method 1. Selecting “candidate” subgroups of genes: We used biological knowledge and hierarchical clustering algorithms to select four subgroups: • Proliferation: sets of genes involved the biological process of proliferation • T-cell: genes preferentially expressed in T-cells • Lymphnode: Sets of genes normally expressed in lymphnodes • GCB: genes that distinguish germinal centre B-cells from other stages in B-cell ontogeny 2. Identify subgroups of genes most related to the separation GCB-like / AB-like • Training of SVM, MLP and LP as classifiers using each subgroup of genes and all the subgroups together (All) 5 classification tasks • Leave-one-out methods used with gaussian, polynomial and linear SVM • 10 -fold cross-validation with gaussian, polynomial and linear SVM, MLP and LP.

GCB signature Learn. machine model Gen. error St. dev. Prec. Sens. SVM-linear 10. 50 11. 16 90. 00 SVM-poly SVM-RBF 8. 70 4. 50 14. 54 9. 55 96. 67 88. 33 100. 0 90. 00 8. 70 10. 50 All signatures Learn. machine model Gen. error St. dev. SVM-linear 15. 00 11. 16 SVM-poly 14. 00 18. 97 SVM-RBF 10. 00 10. 54 90. 90 MLP LP 8. 70 10. 87 13. 28 14. 28 Prec. 85. 00 93. 33 100. 00 Sens. 85. 00 76. 67 95. 00 86. 36 86. 96 90. 90

Results

The second problem: summary • The results support the hypothesis of Alizadeh about the existence of two distinct subgroups in DLBCL. • The heuristic method identifies the GCB signature as a cluster of coordinately expressed genes related to the separation between the GCB-like and AB-like DLBCL subgroups.

Developments I. Methods to discover subclasses of tumors on molecular basis. II. Methods to identify small subsets of genes correlated to tumors Integrating “a priori” biological knowledge, supervised machine learning methods and unsupervised clustering methods - Refinements of the proposed heuristic method using clustering algorithms with semi-automatic selection of the number of the significant subgroups of genes. Stratifying patients into molecularly relevant categories, enhancing the discrimination power and precision of clinical trials - Greedy algorithms based on mutual information measures. New perspectives on the development of new cancer therapeutics based on a molecular understanding of the cancer phenotype. Discovery of new subclasses of tumors Enhancing biological knowledge about tumoral processes Automatic diagnosis of tumors using DNA microchips