An Enriched Approach to Combining Highdimensional Genomic and
An Enriched Approach to Combining High-dimensional Genomic and Low-Dimensional Phenotypic Data By Birol Emir Demissie Alemayehu Javier Cabrera Zhenya Cherkas
Outline • • • Introduction/Background Conventional approaches Enriched approach Extensions Simulation results Concluding remarks
Introduction/Background • Combine and analyze clinical and genetic information (Aramburu et al. 2015) – low-dimensional phenotypic data and high dimensional SNP/gene expression data • Adjusting for ancestry information important for association studies – Reduced Power • Lower chance of detecting true effects – Confounding • Higher chance of spurious association finding
Introduction/Background • Big Fish Eats Small Fish: Importance of clinical variables tends to diminish in presence of large volumes of SNP/gene expression data (rnaseq, microarrays) • Curse of Dimensionality: Variable selection algorithms such as LASSO mostly successful in discovering signals of dimension ≤ n/log(p) (Cai 2016)
Standard Analytical Approach to Combine Clinical and Genomic Data Sources Clinical/ Demographics Data Model Genomic Data Select SNPs
Conventional Approaches for Selecting SNPs Alternative approaches available to reduce dimensionality and perform model selection Univariate screening Multivariate modelling Eigenstrat (Correcting SNPs for Ancestry)
Univariate Screening • Approach – Apply “correlation/prognostic filters” to remove genes/SNPs not related to outcome of interest • Univariate: Simple tests such as chi-squared or logistic regression applied to each SNP • Typically, hard thresholds from Bonferroni or FDR are used – Combine selected SNP’s with the clinical variables and model this data • Penalized or standard regression models • Limitations – One SNP at a time not ideal • Ignores the association among the SNPs and the clinical variables – Ignores ancestry information
Multivariable Modeling • Approach – Combine all SNPs and clinical data – Apply penalized regression model to select relevant variables (e. g. group LASSO etc. ) • Limitations – Can only detect signals of dimension ≤ n/log(p) – Ignores ancestry information – Clinical data not optimally utilized due to dimension disparity
EIGENSTRAT Three steps 1. Apply PCA to genotype data • Reduce dimensions, describing as much variability as possible • Axes of variation may relate to ancestry differences 2. Continuously adjust genotypes and phenotypes by amounts attributable to ancestry along each axis, via computing residuals of linear regressions • Creates a virtual set of matched cases and controls. 3. Compute association statistics using ancestry-adjusted genotypes and phenotypes. Carlson, C. S. , Eberle, M. A. , Rieder, M. J. et al. (2006) Principal components analysis corrects for stratification in genome-wide association Nature Genetics 38, 904 - 909 http: //www. google. com/url? sa=t&rct=j&q=&esrc=s&source=web&cd=4&cad=rja&uact=8&ved=0 ah. UKEwi. Tsszat. YHUAh. Xl 5 o. MKHU 8 u De 4 QFggz. MAM&url=http%3 A%2 F%2 Fwww. nature. com%2 Fng%2 Fjournal%2 Fv 38%2 Fn 8%2 Ffull%2 Fng 1847. html&usg=AFQj. CNEPOe. X HCr. H 46 AZIW 2 l. Rg. Po. Ct. Ikf. Gg
EIGENSTRAT (cont. ) Original EIGENSTRAT procedure • Code all SNP data {0, 1, 2}, where 0=homozygous, 1=heterozygous, 2=Wild type • Normalize by subtracting mean and dividing by s. d. • Recode missing genotype as 0 • Apply PCA to matrix of coded SNP data • Extract scores for 1 st 10 PC axes • Calculate modified Armitage Trend statistic using 1 st 10 PC scores as covariates Ref: Patterson et al. (2006, PLo. S Genet 2: e 190)
EIGENSTRAT (CONT. ) To correct SNPs for ancestry • One Approach: Apply PCA to SNPs, and estimate PCs that represent ancestry (including gender, race, etc. ) • Apply univariate selection to the SNPs after subtracting the ancestry-related PCs and the outcome of interest. – S: n x 106, SNPs, normalized L 1, L 2; …. . , k: 1 x 106: Loadings S’i. = Si. – Σj=1, …, k (Lj’ S 1. Lj) for i=1, …. n – Ancestry removal is necessary to avoid spurious relationships from data imbalance (such as all white males in one group) – Similarly correct the response for ancestry Y, say Y’ – Apply univariate association tests to the corrected S’ and Y
EIGENSTRAT (cont. ) • Select SNPs using appropriate threshold: Bonferroni or FDR • Combine selected SNP data (S’ and Y’ ) with the clinical variables and model this data ― Penalized or standard regression models • Limitations – Ancestry may be present in more than a few PCs • This may make analysis of remaining PCs signal very doubtful. – One SNP at a time not ideal • Ignores association among SNPs and the clinical variables
Proposed Approach to Combine Clinical and Genomic Data Sources Clinical/ Demographics Data Enriched Eigen. Start Model Genomic Data Use Enriched Selection of SNPs Enriched Penalized Modeling Calculate the weights for both data sets Model
Enriched Approach Main idea: • In most analytical approaches, weights are applied to observations rather than variables • Apply weights to variables that correct for spurious information contained by each variable. (Amaratunga Cabrera et al 2014) 1. Construct FDR corrected weights 2. Model the data using weighted variables : Directly or using Ensembles.
FDR Corrected Weights Assuming a null uniform p-value distribution, pi ~Uniform[0, 1] OR q(i) = p(i)FDR= p(i) / (i/G) ; Wi = 1/qi or Wi = -log(qi) q(i) = p(i)FDR= p(i) / p(i), a ; Wi = 1/qi or Wi = -log(qi) • i/G: Expected value of the ith p-value, under the null • a is a tuning constant • p(i), a is the a-percentile of distribution of the ith order statistic (i. e. , instead of the mean i/G) • G: Number of variables
Enriched EIGENSTRAT • For the data containing SNPs, apply the weights to each SNP in the following manner: • S normalized matrix of SNPs coded as previously • N number of observations • g number of SNPs • W vector of weights S ~ N x 10 g S* = SW • Proceed replacing S by S*, applying PCA to genotype data as in the original EIGENSTRAT
Enriched Penalized Modeling •
Enriched Penalized Modeling (cont. ) •
Simulated Example • We generated two group of subjects each of size 50, one with flares (non-responders) and the other without flares (responders) mimicking an RCT in Lupus • For each subject we obtain 20, 000 SNPs of which 5% have a mild signal above the noise level and 0. 1% had moderate/strong signal • 8 clinical variables (Age, gender, BMI, Baseline Disease Activity Variables) • Outcome variable is binary indicator of the presence of flares • Age and Gender have been shown to have association with flares • The correlation structure between and within groups set to be similar to that observed in a clinical trial conducted on Lupus patients
Results from Simulated Data • Applying Eigen. Strat Approach • Applied to the SNPs encoded numerically (0, 1 and 2) • Three clusters can be seen separating three different ancestries (three ovals) • The first two principal components below Green: Responder; Red: Non-responder
Results from Simulated Data (cont. ) • Plots of the first 10 Eigenstrat principal components show sub-optimal separation
Results from Simulated Data (cont. ) • The best separation of the groups appeared in PC 6, PC 8 and PC 10 • Significant overlap of groups still apparent
Enriched Eigen. Strat Approach • Applying Enriched Eigen. Strat Approach • Applied to the SNPs encoded numerically (0, 1 and 2) • Apparent separation of not only responders and non-responders, but also males and females along the first two principal components Green: Responder; Red: Non-responder Open circles Females Solid circles Males
Alternative SNP Representation • Alternative representation of the SNPs: – Use two variables to encode each SNP (in reference to Wild Type) – X 1 will code for heterozygous – X 2 will code for homozygous – “Dummy variables” – Preferred when homozygous is rare • To construct two weights for each of X 1 and X 2 – Construct 2 x 2 table and use results of association test (Chi. Squared) – Proceed similarly for the weights calculation using the Q value
Simulated Data Results for Alternative SNP Encoding • Applying Enriched Eigen. Strat Approach • Applied to the SNPs encoded as two “dummy variables” • Even more apparent and interpretable separation of not only responders and non -responders, but also males and females along the first two principal components of the Enriched Eigenstrat Green: Responder; Red: Non-responder Open circles Females Solid circles Males In this direction, PCA 1 Higher (+) values are associated with females (open circles) and lower (-) values are associated with Males (solid) In this direction, PCA 2 Higher (+) values are associated with non-responder (red) and lower (-) values are associated with responders (green)
Results from Simulated Data (cont. ) • Original Eigen. Strat Approach applied to the SNPs encoded as two “dummy variables” did not yield similar results
Results from Simulated Data (cont. ) • Moreover, the results of original Eigen. Strat Approach applied to the SNPs encoded as two “dummy variables” and coded numerically as {0, 1, 2} yielded similar results Eigen. Strat with SNP as continuous Eigen. Strat with SNP as 2 dummies
Numerical Results Comparison in the Simulated Dataset • Comparison of top SNPs identified indicated that Enriched Eigen. Strat was able to – Correctly identify top SNPs used for data simulation – Fewer spurious associations – Identify significant clinical variables along with SNPs Regular EIGENSTRAT Enriched EIGENSTRAT
Concluding Remarks • • Standard modeling and variable selection approaches are in routine use, since they are intuitive and appealing However, in practice, their implementation is often infeasible or unreliable – – Curse of dimensionality High level of spurious information in data • Proposed idea is a viable alternative, since it applies weights to variables, rather than observations, there by dampening the spurious information contained by each variable • Offers attractive possibility to simultaneously use genetic and clinical information in the model
Future Work • An R package is underway and will be submitted to CRAN
References • • • Amaratunga, D. , & Cabrera, J. (2001). Analysis of data from viral DNA microchips. Journal of the American Statistical Association, 96(456), 1161 -1170. Amaratunga, D. , Cabrera, J. , & Lee, Y. S. (2008). Enriched random forests. Bioinformatics, 24(18), 2010 -2014. Amaratunga, D. , Cabrera, J. , Cherckas, Y. , & Lee, Y. S. (2012). Ensemble classifiers. In Contemporary Developments in Bayesian Analysis and Statistical Decision Theory: A Festschrift for William E. Strawderman (pp. 235 -246). Institute of Mathematical Statistics. Aramburu A, Zudaire I, Pajares MJ, et al. Combined clinical and genomic signatures for the prognosis of early stage non-small cell lung cancer based on gene copy number alterations. BMC Genomics. 2015; 16: 752. doi: 10. 1186/s 12864 -015 -1935 -0. (2007) J. Cabrera and C. Yu. Estimating the proportion of differentially expressed genes in comparative DNA Microarray Experiments, IMS Lecture Notes-Monograph Series, no 54, (Regina Liu, William Strawderman, Cun-Hui Zhang, Eds. ) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267 -288.
BACK-UP
Abstract We consider an approach for combining and analyzing high dimensional genomic and low dimensional phenotypic data. The approach involves use of a scheme of weights attached to the variables instead of the observations and, hence, permits incorporation of the information provided by the low dimensional data source. This approach can be incorporated into commonly used techniques, including EIGENSTRAT, random forests and penalized regression.
General FDR Weights • • – – • Given any procedure that produces individual p-values for each predictor, we can assign weights as l = h(p. FDR) ≈ –log(p. FDR) Direct use of weights Weighted Random Forest, Trees, LDA, SVM As individual shrinkage parameters – Lasso, Elastic Net, Ridge Indirect use as part of the ensemble algorithm: 1. Draw a bootstrap sample from the data. Call the observations which are not in the bootstrap sample the "out-of-bag" data. 2. Generate m randomly selected features according to the weights {wi} and use them together with the bootstrap sample to construct a classifier. 3. Use the classifier to predict out-of-bag data to form majority votes. 4. Repeat steps 1 -3 N times and collect an ensemble of N rules. – Prediction of test data is done by majority votes from predictions from the ensemble of rules.
Results from Simulated Data Enriched Eigen. Strat Approach: Summary of the first 2 PCs using 2 dummy variables / SNP and 1 weight / SNP). In this direction, PCA 1 Higher (+) values are associated with females (open circles) and lower (-) values are associated with Males (solid) Green: Responder; Red: Non-responder Open circles Females Solid circles Males Similar to the previous figure with 2 weights / SNP but perhaps NOT AS GOOD In this direction, PCA 2 Higher (+) values are associated with non-responder (red) and lower (-) values are associated with responders (green)
Enriched Penalized Modeling (cont. ) •
- Slides: 36