Integrating in Vitro Drug Sensitivity and Genomics Data

  • Slides: 26
Download presentation
Integrating in Vitro Drug Sensitivity and Genomics Data for Identification of Novel Drug Pathway

Integrating in Vitro Drug Sensitivity and Genomics Data for Identification of Novel Drug Pathway Associations Cong Li and Ray Liu Yale University and Takeda Pharmaceutical Inc. May 19, 2015 Presented at MBSW Muncie, IN, USA

Introduction Gene Change in gene expression Dr ug IC 50 Drug Interests: indication selection;

Introduction Gene Change in gene expression Dr ug IC 50 Drug Interests: indication selection; patient selection. Experiments : cell lines drugs response assay; microarray assay Data: IC 50 and microarray gene expression current analysis practice: stepwise and test-based Our goal: develop a method that analyze available data jointly and incorporate biological information Cell line

Microarrays

Microarrays

Questions Often Asked • Design issues • Which genes are differentially expressed between the

Questions Often Asked • Design issues • Which genes are differentially expressed between the conditions? • Which genes can be used to classify/predict? How? • Can biological networks be inferred from these data? • What are the biological stories in the data?

Drug Pathway Questions • Current drug development framework typically considers the effect of a

Drug Pathway Questions • Current drug development framework typically considers the effect of a compound on a single target • Pathway-based approaches for drug discovery consider therapeutic effects of compounds in the global physiological environment • For many compounds, their target pathways and mechanism of action are still unknown • How to infer the target pathways of drugs?

Motivating Data Sets http: //www. broadinstitute. org/ccle/ Gene expression data: Affymetrix U 133+2 arrays,

Motivating Data Sets http: //www. broadinstitute. org/ccle/ Gene expression data: Affymetrix U 133+2 arrays, mapped to ~19, 000 genes across over 1000 cancer cell lines; among them, 480 cell lines have available drug response data. Use genes included in two lists: (1) 766 cancer-related genes (Chen, et al. , 2008); (2) 8919 genes from the Integrated Druggable Genome Database (IDGD) Project (Hopkins and Groom, 2002; Russ and Lampel, 2005). Pathway association information: Retrieved from the KEGG MEDICUS database (Kanehisa, et al. , 2010). 58 pathways which are either known to be related to cancer or have drug targets. Among the genes selected in step (1), 1863 genes are covered by these 58 pathways and constitute the final list of genes in our real data analysis. Drug response data: 24 drugs annotated in the Cancer. Resource database (Ahmed, et al. , 2011). log(Activity Area). 22 drugs with known targets covered by the 58 pathways.

Overview of the 22 drugs

Overview of the 22 drugs

Activity Area (shaded area) Activity area is a combined measure of both drug potency

Activity Area (shaded area) Activity area is a combined measure of both drug potency and drug efficacy, whereas GI 50 only measures drug potency.

Data Format Drug sensitivity values (e. g. Activity Area or GI 50) Basal gene

Data Format Drug sensitivity values (e. g. Activity Area or GI 50) Basal gene expression levels (before drug treatment) Cell line 1 Cell line 2 Cell line 3 … …. . … gene 1 gene 2 gene 3 gene 4 ……. . drug 1 drug 2 drug 3

Model Description Spike-and-Slab mixture prior (West, 2003) for the factor loading matrix W 1

Model Description Spike-and-Slab mixture prior (West, 2003) for the factor loading matrix W 1 and W 2 to impose sparsity and utilize prior knowledge on gene-pathway and drug-pathway associations (matrix L 1 and L 2).

 • Instead of adopting a full Bayesian treatment, we use the following integrative

• Instead of adopting a full Bayesian treatment, we use the following integrative Penalized Matrix Decomposition (i. Pa. D) framework Note the notation differences from i. Fad: Y(1) is the drug response profile matrix Y(2) is gene expression profile matrix X is the pathway activity level matrix B(1) and B(2) are the pathway loading matrices for drug responses and gene expressions respectively • The indexes of the non-zero elements in B(2) are known and denoted by Γ • The major interest is to find the non-zero elements in B(1) • • •

 • The algorithm • The optimization problem in i. Pa. D is actually

• The algorithm • The optimization problem in i. Pa. D is actually a bi-convex problem, motivating the following block-wise optimization strategy: • Step 1. Optimize over B(1) and B(2) while keeping X fixed • Step 2. Optimize over X while keeping B(1) and B(2) fixed • Step 3. Iterate between Step 1 and 2 until convergence

 • The algorithm • When X is fixed, optimizing each column of B(1)

• The algorithm • When X is fixed, optimizing each column of B(1) is a LASSO problem • When X is fixed, optimizing each column of B(2) is an ordinary least square (OLS) problem • When B(1) and B(2) are fixed, X can optimized using an iteratively projected gradient descent algorithm

 • Dealing with missing values – A gene/drug or cell line that is

• Dealing with missing values – A gene/drug or cell line that is completely missing can be excluded – However, partially missing genes/drugs or cell lines shall be kept in the analysis • In our block-wise algorithm, B(1) and B(2) can be optimized column by column with the missing values excluded • However, optimizing X is less straightforward because neither its rows nor columns can be optimized separately

 • We use the following soft-impute algorithm to optimize X in the presence

• We use the following soft-impute algorithm to optimize X in the presence of missing values • Ω indexes the observed elements in a matrix and PΩ(*) is an operator that projecting a matrix onto the space of its observed elements.

 • Parameter tuning – There is a parameter λ that controls the sparsity

• Parameter tuning – There is a parameter λ that controls the sparsity of B(1) – One way to use the method is to apply a decreasing sequence of λ’s to obtain a sequence of solutions for B(1) – We can also perform cross-validation on the drug response profile matrix Y(1) Green: training data; Black: testing data • Significance test – After finding an appropriate λ value, we can perform permutation tests to establish the significance of the identified drug-pathway associations – Permute the cell lines (rows) in Y(1) while keeping Y(2) unchanged

 • Simulations – We performed the following four sets of simulations (the 58

• Simulations – We performed the following four sets of simulations (the 58 pathways in the real data were used; the number of drugs d = 22) Sample Size Sparsity of B(1) Signal-to-Noise Ratio Unbalanced Signal-to. Noise Ratio N 120 240 360 480 480 480 η 0. 1 0. 02 0. 05 0. 2 0. 1 0. 1 SNR 1 0. 5 0. 1 0. 25 1 0. 5 SNR 2 0. 5 0. 1 0. 25 1 0. 5 0. 1 • The simulated data sets were analyzed by both i. Fad and i. Pa. D. • Their performances were evaluated by Area Under the ROC curve (AUC)

 • The performances between the two methods are similar

• The performances between the two methods are similar

 • The performances between the two methods are similar (cont. ) • However,

• The performances between the two methods are similar (cont. ) • However, i. Pa. D is much faster – 1000 iteration in i. Fad costs 4~5 days – Solving a sequence of λ’s takes only ~6 minutes

 • Real Data Analysis – We analyzed the CCLE data set described earlier

• Real Data Analysis – We analyzed the CCLE data set described earlier with both i. Fad and i. Pa. D – i. Fad: 2, 000 MCMC iterations; i. Pa. D: 10 -fold CV followed by 2, 000 permutations (null distribution was approximated using a mixture of a normal distribution and a point mass at zero) – We call a drug-pathway association validated if the pathway contains at least one protein targeted by the drug – Among the 58 x 22 = 1276 drug-pathway pairs, 195 pairs are validated associations (195/1276 = 15. 3%) – Considering the randomness in the algorithms, we ran five repeats – Among the top 50 drug-pathway association pairs identified by i. Fad, 7. 0 (averaged over five repeats) pairs were validated; 16. 6 for i. Pa. D – The top associations identified by i. Pa. D were relatively consistent over the five repeats; but not consistent for i. Fad (probably did not converge) – Running time: 2, 000 MCMC iterations cost ~230 hours on a standard laptop computer (2. 4 GHz dual core CPU with 8 G memory running on Mac OS X 10. 9); 2, 000 permutations cost ~6 hours for i. Pa. D

 • The Chronic Myeloid Leukemia Pathway

• The Chronic Myeloid Leukemia Pathway

 • The Erb. B Signaling Pathway

• The Erb. B Signaling Pathway

Limitations/Future Work • Relatively simple additive models • Limited and unreliable information on pathways

Limitations/Future Work • Relatively simple additive models • Limited and unreliable information on pathways • Pathway network topology not considered • Other sources of information • Tradeoff between model simplicity, computational feasibility, and real biological complexity

 • Thank you!

• Thank you!