Molecular Classification of Cancer Class Discovery and Class

  • Slides: 39
Download presentation
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring

Overview n Motivation n Microarray Background n Our Test Case n Class Prediction n

Overview n Motivation n Microarray Background n Our Test Case n Class Prediction n Class Discovery

Motivation n Importance of cancer classification n Cancer classification has historically relied on specific

Motivation n Importance of cancer classification n Cancer classification has historically relied on specific biological insights n We will discuss a systematic and unbiased approach for recognizing tumor subtypes

Microarray Background n Microarrays enable simultaneous measurement of the expression levels of thousands of

Microarray Background n Microarrays enable simultaneous measurement of the expression levels of thousands of genes in a sample n Microarray: – Glass slide with a matrix of thousands of spots printed on to it – Each spot contains probes which bind to a specific gene

Microarray Background (cont. ) n The process: – DNA samples are taken from the

Microarray Background (cont. ) n The process: – DNA samples are taken from the test subjects – Samples are dyed with fluorescent colors and placed on the Microarray – Hybridization of DNA and c. DNA n The result: – Spots in the array are dyed in shades of red to green

Microarray Background (cont. ) n Sample 1 Sample 2 Gene 1 1. 04 2.

Microarray Background (cont. ) n Sample 1 Sample 2 Gene 1 1. 04 2. 08 Gene 2 3. 2 10. 5 Gene 3 3. 34 1. 05 Gene 4 1. 85 0. 09 Microarray data is translated into an n x p table (p – number of genes, n – number of samples)

Demonstration http: //www. bio. davidson. edu/courses/genomics/chip. html

Demonstration http: //www. bio. davidson. edu/courses/genomics/chip. html

Our Test Case n 38 bone marrow samples from acute leukemia patients (27 ALL,

Our Test Case n 38 bone marrow samples from acute leukemia patients (27 ALL, 11 AML) n RNA from the samples was hybridized to microarrays containing probes for 6817 human genes n For each gene, an expression level was obtained

Class Prediction n Initial collection of samples belonging to known classes n Goal: create

Class Prediction n Initial collection of samples belonging to known classes n Goal: create a “class predictor” to classify new samples – Look for “informative genes” – Make a prediction based on these genes – Test the validity of the predictor

Informative genes n Genes whose expression pattern is strongly correlated with the class distinction

Informative genes n Genes whose expression pattern is strongly correlated with the class distinction strongly correlated poorly correlated

Neighborhood Analysis n Are the observed correlations stronger than would be expected by chance?

Neighborhood Analysis n Are the observed correlations stronger than would be expected by chance? C represents the AML/ALL class distinction C* is a random permutation of C. Represents a random class distinction

Application to the Test Case n Roughly 1100 genes were more highly correlated with

Application to the Test Case n Roughly 1100 genes were more highly correlated with the AML-ALL class distinction than would be expected by chance

Make a Prediction n Use a fixed subset of “informative genes” (most correlated with

Make a Prediction n Use a fixed subset of “informative genes” (most correlated with the class distinction) n Make a prediction on the basis of the expression level of these genes in a new sample

Prediction Algorithm Each gene Gi votes, depending on whether its expression level Xi in

Prediction Algorithm Each gene Gi votes, depending on whether its expression level Xi in the sample is closer to µ or µ n The magnitude of the vote is Wi Vi n AML ALL – Wi reflects how well the gene is correlated with the class distinction – reflects the deviation of Xi from the average of µ and µ ALL AML

Prediction Algorithm (cont. ) n The votes for each class are summed to obtain

Prediction Algorithm (cont. ) n The votes for each class are summed to obtain total votes VAML and VALL

Prediction Algorithm (cont. ) n The prediction strength is calculated: n The sample is

Prediction Algorithm (cont. ) n The prediction strength is calculated: n The sample is assigned to the winning class provided that the PS exceeds a predetermined threshold (0. 3 in the test case)

Testing the Validity of Class Predictors n Cross Validation – withhold a sample –

Testing the Validity of Class Predictors n Cross Validation – withhold a sample – build a predictor based on the remaining samples – predict the class of the withheld sample – repeat for each sample n Assess accuracy on an independent set of samples

Application to the Test Case 50 genes most highly correlated with the AML-ALL distinction

Application to the Test Case 50 genes most highly correlated with the AML-ALL distinction were chosen n A class predictor based on these genes was built n

Application to the Test Case n Performance in cross validation: – Out of 38

Application to the Test Case n Performance in cross validation: – Out of 38 samples there were 36 predictions and 2 uncertainties (PS < 0. 3) – 100% accuracy – PS median 0. 77

Application to the Test Case (cont. ) n Performance on an independent set of

Application to the Test Case (cont. ) n Performance on an independent set of samples: – Out of 34 samples there were 29 predictions and 5 uncertainties (PS < 0. 3) – 100% accuracy – PS median 0. 73

Comments n Why 50 genes? – Large enough to be robust against noise –

Comments n Why 50 genes? – Large enough to be robust against noise – Small enough to be readily applied in a clinical setting – Predictors based on between 10 to 200 genes all performed well n Genes useful for cancer class prediction may also provide insight into cancer pathogenesis and pharmacology

Comments (cont. ) n Creation of a new predictor involves expression analysis of thousands

Comments (cont. ) n Creation of a new predictor involves expression analysis of thousands of genes n Application of the predictor then requires only monitoring the expression level of few informative genes

Class Discovery n Cluster tumors by gene expression – Apply a clustering technique to

Class Discovery n Cluster tumors by gene expression – Apply a clustering technique to produce presumed classes n Evaluation of the Classes: – Are the classes meaningful? – Do they reflect true structure?

Clustering Technique - SOMs n SOMs – Self Organizing Maps Well suited for identifying

Clustering Technique - SOMs n SOMs – Self Organizing Maps Well suited for identifying a small number of prominent classes – Find an optimal set of “centroids” – Partition the data set according to the centroids – Each centroid defines a cluster consisting of the data points nearest to it n We won't go into details about the calculation of SOMs

Application of a two-cluster SOM to the test case Class A 1: 24 ALL,

Application of a two-cluster SOM to the test case Class A 1: 24 ALL, 1 AML Class A 2: 10 AML, 3 AML Quite effective at automatically discovering the two types of leukemia n Not perfect n

Evaluation of the Classes n How can we evaluate such classes if the “right”

Evaluation of the Classes n How can we evaluate such classes if the “right” answer is not already known? n Hypothesis: class discovery can be tested by class prediction – If the classes reflect true structure, then a class predictor based on them should perform well n Let’s test this hypothesis. . .

Validity of Predictors Based on A 1 and A 2 n Predictors based on

Validity of Predictors Based on A 1 and A 2 n Predictors based on different numbers of informative genes performed well n For example: a 20 -gene predictor

Validity of Predictors Based on A 1 and A 2 cont. n Performance on

Validity of Predictors Based on A 1 and A 2 cont. n Performance on independent samples: – PS median 0. 61 – Prediction made for 74% of samples

Validity of Predictors Based on A 1 and A 2 cont. n Performance in

Validity of Predictors Based on A 1 and A 2 cont. n Performance in cross validation: – 34 accurate predictions with high prediction strength – One error – Three uncertains

the one cross validation error 2 of the 3 cross validation uncertains

the one cross validation error 2 of the 3 cross validation uncertains

Iterative Procedure n Use a SOM to initially cluster the data n Construct a

Iterative Procedure n Use a SOM to initially cluster the data n Construct a predictor n Remove samples that are not correctly predicted in cross-validation n Use the remaining samples to generate an improved predictor n Test on an independent data set

Validity of Predictors Based on Random Clusters n Performance: – Poor accuracy in cross

Validity of Predictors Based on Random Clusters n Performance: – Poor accuracy in cross validation – Low PS on independent samples

Conclusion n The AML-ALL distinction could have been automatically discovered and confirmed without previous

Conclusion n The AML-ALL distinction could have been automatically discovered and confirmed without previous biological knowledge

Application of a 4 -cluster SOM to the Test Case

Application of a 4 -cluster SOM to the Test Case

Evaluation of the Classes n Complement approach: – Construct class predictors to distinguish each

Evaluation of the Classes n Complement approach: – Construct class predictors to distinguish each class from its complement n Pair-wise approach: – Construct class predictors to distinguish between each pair of classes Ci, Cj – Perform cross validation only on samples in Ci and Cj

Evaluation of the Classes n Class predictors distinguished the classes from one another, with

Evaluation of the Classes n Class predictors distinguished the classes from one another, with the exception of B 3 versus B 4

Conclusion n The results suggest the merging of classes B 3 and B 4

Conclusion n The results suggest the merging of classes B 3 and B 4 n The distinction corresponding to AML, B -ALL and T-ALL was confirmed

Uses of Class Discovery n Identify fundamental subtypes of any cancer n Search for

Uses of Class Discovery n Identify fundamental subtypes of any cancer n Search for fundamental mechanisms that cut across distinct types of cancers

Questions? n Thank you for listening

Questions? n Thank you for listening