A Shrinkage Regression Approach to Tackle the HLA

  • Slides: 20
Download presentation
A Shrinkage Regression Approach to Tackle the HLA Region Charlotte Vignal Variable Selection Workshop

A Shrinkage Regression Approach to Tackle the HLA Region Charlotte Vignal Variable Selection Workshop Vienna, July 26 th 2008

Outline Overview of the HLA system and the challenge of analysing data from the

Outline Overview of the HLA system and the challenge of analysing data from the HLA region Multivariate association test using a Bayesian-inspired shrinkage regression approach Application to the rheumatoid arthritis case-control study Conclusion

The Human Leukocyte Antigen System • A genomic region found in almost all vertebrates,

The Human Leukocyte Antigen System • A genomic region found in almost all vertebrates, the major histocompatibility complex (MHC) - gene composition and arrangement vary between species (below) • In humans, the MHC is the HLA system • A set of genes encoding proteins essential to immune response • Major role in histocompatibility and protection against pathogens MOUSE RAT CHIMPANZEE HUMAN Kelley et al. Immunogenetics (2005)

The Challenge Susceptibility to many complex disorders maps to the HLA region High degree

The Challenge Susceptibility to many complex disorders maps to the HLA region High degree of correlation within the region hampers the identification of causal variants Widely used approaches test the effect of one genetic variable at a time Require methods that allow the detection of (possibly multiple) causal variants among highly correlated data

Multi-SNP Methods can be more Powerful than Single-SNP Analyses Multivariate logistic regression – Problematic

Multi-SNP Methods can be more Powerful than Single-SNP Analyses Multivariate logistic regression – Problematic when n. Vars >> n. Obs – Stepwise procedures can be unstable in presence of many highlycorrelated terms Shrinkage method using Bayesian logistic regression – A variable selection approach – Based on the Least Absolute Shrinkage and Selection Operator approach (LASSO) (Tibshirani 1996) – Fast implementation using the Bayesian Binary Regression (BBR) software for text-categorisation analysis (Genkin et al. 2004, http: /www. stat. rutgers. edu/~madigan/BBR)

Bayesian Logistic Regression for variable selection Each coefficient βj has a Laplace prior distribution

Bayesian Logistic Regression for variable selection Each coefficient βj has a Laplace prior distribution with mode 0 and prior variance ν=2/λ 2, where λ is the penalty factor – Mode 0 encodes a prior belief of no effect – The prior variance determines the strength of this belief and hence the sparseness of the fitted model The maximum a posteriori (posterior mode) estimates are often zero or else shrunk towards zero Terms with non-zero are included in the final model, and treated as significant The value of gives a (shrunk) measure of effect size

p(x) The Density of the Laplace Distribution x ! Effect size estimates are biased

p(x) The Density of the Laplace Distribution x ! Effect size estimates are biased towards zero; Over-shrinking true effects can lead to non-causal correlated variables to be retained

Application The Rheumatoid Arthritis Dataset RA is an autoimmune disease and a complex disorder

Application The Rheumatoid Arthritis Dataset RA is an autoimmune disease and a complex disorder – Estimated genetic contribution of ~30 -50% – The HLA region is strongly implicated in RA susceptibility – Genetic associations reported with a biomarker called the shared epitope (SE) defined by a class of alleles at HLA-DRB 1 – The mechanism by which RA is determined is still unknown Is the SE association the only HLA effect predisposing to RA? The subjects: 842 RA cases and 957 controls (but 774 cases and 945 controls with no missing data analysed) The independent variables: – 2, 302 genetic markers, a continuous variable coded as 0, 1 and 2 based on the number of allele copies – The shared epitope, a continuous variable coded as 0, 1, 2 based on the number of shared epitope positive (SE+) alleles

The Effect of Shared Epitope on RA Effect SE carriage SE+ vs. SE SE+

The Effect of Shared Epitope on RA Effect SE carriage SE+ vs. SE SE+ copies Wald P OR [95% CI] < 0. 0001 5. 1 [4. 1; 6. 3] 1 copy vs. 0 copy < 0. 0001 3. 7 [2. 9; 4. 6] 3. 2 [2. 4; 4. 3] 2 copies vs. 1 copy 11. 8 [8. 6; 16. 1] 2 copies vs. 0 copy The presence of SE is strongly associated with RA Ø Increasing risk for RA associated with the number of SE+ allele copies Ø The objective: to investigate the presence of additional causal variants in the HLA region, possibly correlated with SE Ø

Specification of the Penalty λ • Cases and controls permuted 100 times for each

Specification of the Penalty λ • Cases and controls permuted 100 times for each λ within each SE group (i. e. SE effect retained) • SE (additive term) included in each model • λ selected if false positive per model < 1 Ø λ = 62 was selected for further analyses

The Effect of Shrinking a True Effect R 2 between each genetic variables and

The Effect of Shrinking a True Effect R 2 between each genetic variables and SE across the HLA region Ø In blue are the genetic variables selected by BLR in addition to SE Ø Three variables selected are correlated with SE ØShrinking a known effect may cause correlated SNPs to be selected

The Effect of Shrinkage on True Effects To investigate the effect of shrinkage, SE

The Effect of Shrinkage on True Effects To investigate the effect of shrinkage, SE included twice (SE & SEfake) in the model: Ø When SE and SEfake are shrunk, both variables retained – Shrinking a known effect may cause correlated SNPs to be selected Ø When SE is not shrunk, only SE is retained – Correlated SNPs could be eliminated The shrinkage factor was not applied to SE in subsequent analyses (λ = 0)

BLR and Correlated Data Can the BLR approach distinguish positive effects from spurious associations

BLR and Correlated Data Can the BLR approach distinguish positive effects from spurious associations in presence of correlation? Ø 4 variables correlated with SE were used to evaluate error rates and power Ø Records of each variables re-distributed in cases and controls to achieve different size of OR while maintaining correlation with SE Ø Error rate and power assessed by permuting cases and controls — Error rate: frequency of the variables selected beyond SE & the simulated correlated variables over 100 permutations — Power: frequency of the simulated variables over 100 permutations

Power • Selection of simulated variables correlated with SE Øvariables moderately correlated with SE

Power • Selection of simulated variables correlated with SE Øvariables moderately correlated with SE selected if OR> 2 Øvariables highly correlated with SE selected if OR> 5

Error Rate • Selection of simulated variables correlated with SE Ø Under the null,

Error Rate • Selection of simulated variables correlated with SE Ø Under the null, expect 1 false positive per analysis (λ = 62) ØAnalysis generates 1 to 2 false positives per analysis

ATT- BLR Results Comparison • Data were analysed by Armitage Trend Test (ATT) and

ATT- BLR Results Comparison • Data were analysed by Armitage Trend Test (ATT) and BLR SNPDE PATT-adj SE 4. 2 e-61 snp 292 1. 9 e-6 0. 03 2. 6 e-6 0. 02 snp 271 3. 2 e-5 0. 02 snp 645 9. 6 e-5 8. 5 e-6 snp 068 2. 2 e-5 0. 04 snp 384 2. 4 e-5 0. 002 snp 465 9. 7 e-6 0. 03 snp 156 2. 3 e-5 0. 001 snp 225 3. 1 e-6 0. 05 • With λ=62, BLR identified 10 SNPs snp 576 • Single-point analysis using ATT identified 109 associated SNPs at α = 4. 34 e-04 = 1/2302 • Variables selected by BLR are not correlated with SE R 2 (SNP, SE)

Additional Analysis The NEG Distribution • Data re-analysed using the normal-exponential-gamma (NEG) prior with

Additional Analysis The NEG Distribution • Data re-analysed using the normal-exponential-gamma (NEG) prior with parameters set to expect 1 false positive per model (Hoggart et al. PLo. S (2008)) ! NEG has heavier tails to allow sparser solutions

Additional Analysis The NEG Distribution Ø NEG identified 4 variables; of which three (snp

Additional Analysis The NEG Distribution Ø NEG identified 4 variables; of which three (snp 271, snp 384, snp 545) were also retained by DE Ø Variables identified with NEG prior are less correlated among themselves and with SE than those selected using DE Ø Three of the selected variables are in genes/region reported to contribute to RA susceptibility: BAT 1 and HLADQA 1/DQB 1

Conclusions BLR appears to perform better than single-point association analysis (ATT) when data are

Conclusions BLR appears to perform better than single-point association analysis (ATT) when data are correlated Ø Computationally efficient Ø Identifies fewer positive results (10 vs. 109) Ø Correlation might be more effectively handled Simulation analyses confirm reasonable power and error rate Three variables identified by both DE and NEG priors lie in genes previously implicated in RA Results suggest the presence of independent RA-associated effects in the HLA region

Acknowledgements David Balding, Imperial College, UK Clive Hoggart, Imperial College, UK Aruna Bansal, GSK,

Acknowledgements David Balding, Imperial College, UK Clive Hoggart, Imperial College, UK Aruna Bansal, GSK, UK The Genetics Division at GSK