Computational methods for the analysis of rare variants

  • Slides: 32
Download presentation
Computational methods for the analysis of rare variants Shamil Sunyaev Harvard-M. I. T. Health

Computational methods for the analysis of rare variants Shamil Sunyaev Harvard-M. I. T. Health Sciences & Technology Division

Combine all non-synonymous variants in a single test Theory: 1) Most new missense mutations

Combine all non-synonymous variants in a single test Theory: 1) Most new missense mutations are functional (mutagenesis, population genetics, comparative genomics) 2) Most new missense mutations are only weakly deleterious (population genetics) 3) Most functional missense mutations are likely to influence phenotype in the same direction (mutagenesis, medical genetics) Data: multiple candidate gene studies HDL-C, LDL-C, Triglycerides, BMI, Blood pressure, Colorectal adenomas Kryukov et al. , PNAS 2009

Combining variants in a single test Disease Control

Combining variants in a single test Disease Control

Combining variants in a single test Disease Sequencing errors Control

Combining variants in a single test Disease Sequencing errors Control

Combining variants in a single test Disease Functional variants Control Neutral variants

Combining variants in a single test Disease Functional variants Control Neutral variants

We should focus on functionally significant variation Assign a genotypic score to each gene

We should focus on functionally significant variation Assign a genotypic score to each gene (or pathway) in each individual in the study. Genotypic scores take into account: • The probability that variation is real • The probability that variation is functionally significant

Most tests can be generalized to using genotypic scores Disease Control Score 1 Score

Most tests can be generalized to using genotypic scores Disease Control Score 1 Score 2 Score 3 For quantitative traits we regress trait values on genotypic scores Software prototypes exist

How do we know that the variant is real and functional? Probability that the

How do we know that the variant is real and functional? Probability that the variant is real is provided as part of sequencing quality assessment pipeline.

How do we know that the variant is real and functional? Population genetics Bioinformatics

How do we know that the variant is real and functional? Population genetics Bioinformatics Probability that the variant is functional

Most functional mutations are under selective pressure even if the trait is not

Most functional mutations are under selective pressure even if the trait is not

Probability that a variant is functionally significant given its allele frequency However, this dependence

Probability that a variant is functionally significant given its allele frequency However, this dependence is not robust with respect to s 0!

“Goldilocks” alleles • Special case in terms of study design: alleles of large effect

“Goldilocks” alleles • Special case in terms of study design: alleles of large effect that are frequent enough to be followed up individually in a larger population sample. • Such “goldilocks” alleles are observed in the simulations. There is no optimal and robust weighting scheme or optimal threshold!

Variable threshold (VT) approach

Variable threshold (VT) approach

Variable threshold (VT) approach

Variable threshold (VT) approach

Variable threshold (VT) approach

Variable threshold (VT) approach

Variable threshold (VT) approach

Variable threshold (VT) approach

Variable threshold (VT) approach

Variable threshold (VT) approach

Z-score Variable Threshold (VT) approach max data permutations max Allele frequency z(T) is the

Z-score Variable Threshold (VT) approach max data permutations max Allele frequency z(T) is the z-score of a regression across samples of phenotypes vs. counts of alleles with frequency below threshold T. We maximize z(T) over T. Type I error is controlled by permutations.

Allelic age is informative even conditionally on frequency

Allelic age is informative even conditionally on frequency

Intuition behind the effect Allelic age can be measured by LD decay

Intuition behind the effect Allelic age can be measured by LD decay

Bioinformatics predictions

Bioinformatics predictions

Does the mutation fit the pattern of past evolution? A human VVSTADLCAPSSTKLDER dog FVSTSELCAGSTTRLEER

Does the mutation fit the pattern of past evolution? A human VVSTADLCAPSSTKLDER dog FVSTSELCAGSTTRLEER A fish FLSTSELCVPSTLKVNEK V Statistical issues: -sequences are related by phylogeny -generally, we have too few sequences

Does the mutation fit the pattern of past evolution? • • • We assume

Does the mutation fit the pattern of past evolution? • • • We assume a constant fitness landscape: what is good for fish is good for human! We can estimate whether the mutation fits the pattern of amino acid changes. We can also estimate rate of evolution at the amino acid site

Predictions based on protein structure • Most of pathogenic mutations are important for stability

Predictions based on protein structure • Most of pathogenic mutations are important for stability (good news? ). • DDG is difficult to estimate. • Unfolded protein response pathway has to be taken into account. • Heuristic structural parameters help but less than comparative genomics.

Poly. Phen-2 www. genetics. bwh. harvard. edu/pph 2 Adzhubei, et al. Nature Methods 2010

Poly. Phen-2 www. genetics. bwh. harvard. edu/pph 2 Adzhubei, et al. Nature Methods 2010

Compensatory mutations

Compensatory mutations

Incorporation of Poly. Phen-2 scores into VT-test Kumar S et al. Genome Research 2009

Incorporation of Poly. Phen-2 scores into VT-test Kumar S et al. Genome Research 2009 We incorporated weights approximating these distributions into the test for alleles with frequency below 1% Price, Kryukov et al. , AJHG 2010 (accepted)

This is a general approach • Prediction scores can be easily incorporated into other

This is a general approach • Prediction scores can be easily incorporated into other tests such as WSS, CMC, RVE etc. • Other available prediction methods include SIFT, Pmut, SNAP, SNPs 3 D etc.

We are likely to be underpowered to detect the effect of individual genes on

We are likely to be underpowered to detect the effect of individual genes on traits • Combining signal from multiple genes can dramatically increase power • Although we do not know the right pathways, we can attempt constructing them automatically

SNIPE method http: //string. embl. de/

SNIPE method http: //string. embl. de/

SNIPE method http: //string. embl. de/

SNIPE method http: //string. embl. de/

Acknowledgments The lab: Gregory Kryukov, Alex Shpunt, Adam Kiezun, Ivan Adzhubei, Saurabh Asthana, Victor

Acknowledgments The lab: Gregory Kryukov, Alex Shpunt, Adam Kiezun, Ivan Adzhubei, Saurabh Asthana, Victor Spirin, Steffen Schmidt, David Nusinow, Daniel Jordan HSPH, BWH, MGH Lee-Jen Wei, Alkes Price, Paul de Bakker, Shaun Purcell