Statistical Power and Metaanalysis Pak Sham International Workshop

  • Slides: 35
Download presentation
Statistical Power and Meta-analysis Pak Sham International Workshop on Statistical Genetic Methods for Human

Statistical Power and Meta-analysis Pak Sham International Workshop on Statistical Genetic Methods for Human Complex Traits March 8, 2017

Significance testing evaluates evidence for association ¢ ¢ ¢ Question: Does the genotype at

Significance testing evaluates evidence for association ¢ ¢ ¢ Question: Does the genotype at a particular locus have an effect on the phenotype? Calculate a test statistic (e. g. chi-squared) from sample data on phenotype and genotype, then convert the statistic (e. g. by referring to a chi-squared distribution) to a p-value. If the p-value is below a certain critical value (e. g. 0. 05 when testing a single hypothesis, 5 x 10 -8 in GWAS), then a “significant” result is reported, meaning that there is evidence for an effect

How is effect size defined? ¢ ¢ ¢ The effect size of a binary

How is effect size defined? ¢ ¢ ¢ The effect size of a binary (0, 1) factor on a trait can be defined as l For quantitative trait - mean trait difference between the 2 groups l For dichotomous trait (e. g. disease status) - log(odds ratio) between the 2 groups Odds ratio can be estimated from both cohort and casecontrol data (whereas risk ratio can be estimated only from cohort data) A log(odds ratio) of 0 represents no effect

Complication: dominance ¢ ¢ ¢ Definition of effect size is complicated for diploid locus

Complication: dominance ¢ ¢ ¢ Definition of effect size is complicated for diploid locus because of the possibility of dominance interaction For example, the effect of the allele inherited from the father may depend on what allele is inherited from the mother. Additive effect is defined as the average effect of an allele, averaged over the possible values of the other allele in the genotype Effect = (m. AA-ma. A)P(A)+(m. Aa-maa)P(a) Basic association analysis tests for additive effects Further association analyses may test for dominance and epistatic interactions

Two types of errors in significance testing True State Test Outcome Not significant H

Two types of errors in significance testing True State Test Outcome Not significant H 0 H 1 Correct Type 2 error Significant (H 0 rejected) Type 1 error Correct

Type 2 error probability = Statistical power Probability of not rejecting H 0 given

Type 2 error probability = Statistical power Probability of not rejecting H 0 given that H 1 is true But: How to define H 1? H 1 can range from a tiny to a huge difference from H 0 (Effect size can range from very close to 0 to very far from 0) The bigger the effect size, the higher the statistical power

Other determinants of statistical power Type 1 error rate The more stringent (smaller) we

Other determinants of statistical power Type 1 error rate The more stringent (smaller) we set the critical p-value for rejecting H 0, the lower the statistical power Sample size The larger the sample, the higher the statistical power

Importance of adequate statistical power ¢ ¢ To not miss a real effect To

Importance of adequate statistical power ¢ ¢ To not miss a real effect To reduce the problem of non-replication of significant findings Two main reasons for non-replication l Under-powered replication study (Type 2 error) l Original result being false positive (Type 1 error) Inadequate statistical power contributes to high false positive report rate (proportion of significant results that are false positives)

Both type 1 and type 2 error rates affect the false positive report rate

Both type 1 and type 2 error rates affect the false positive report rate n Tests H 0 H 1 n(1 -�� 0) n�� 0 NS n�� 0(1 -α) A S NS n�� 0α B n(1 -�� 0)�� C S n(1 -�� ) 0)(1 -�� D What is the false positive report rate? B/(B+D)

Reasons for high false positive report rate ¢ ¢ ¢ Low prior probability of

Reasons for high false positive report rate ¢ ¢ ¢ Low prior probability of association l Appropriate prioritization of variants according to functional annotation may increase prior probability of association Inadequate control of type 1 error rate l Type 1 error rate should be sufficiently stringent to take account of multiple testing Inadequate statistical power l Sample size should be large enough for complex disorders where genetic effects are likely to be small

Critical assumption: effect size ¢ ¢ ¢ Problem: usually we do not know what

Critical assumption: effect size ¢ ¢ ¢ Problem: usually we do not know what the true effect size is, if an effect is present. Can we make the statistical power higher simply by setting a larger effect size? Unfortunately, setting a larger effect size in a power calculation doesn’t make the true effect size any larger. Critical question: What is a realistic effect size?

How to set the effect size ¢ ¢ ¢ Replication study l Effect size

How to set the effect size ¢ ¢ ¢ Replication study l Effect size of original study (with downward adjustment for winner’s curse if original study involved multiple testing) Original study l Typical effect sizes found by previous studies of similar phenotypes and similar genetic variants Often desirable to consider a range of plausible effect sizes and present results in tables or graphs

Illustrative sample size plot OR=1. 2 1. 3 1. 5 2. 0 Wang et

Illustrative sample size plot OR=1. 2 1. 3 1. 5 2. 0 Wang et al, (2005)

What’s the winner’s curse ¢ ¢ ¢ Suppose 100 independent SNPs on a SNP

What’s the winner’s curse ¢ ¢ ¢ Suppose 100 independent SNPs on a SNP chip have identical allele frequency and effect size, such that each has 1% power to reach critical genome-wide significance in a particular study The probability that at least one SNP achieves genome-wide significance is 1 -(0. 99)100 ≈ 0. 63. The estimated effect size of the most significant SNP is expected to be much greater than its true effect size (i. e. biased upwards) A replication study with identical design and sample size has only a 1% chance of replicating this SNP at the same genomewide level of significance. Power calculation based on the effect size estimate of the original study will be grossly over-optimistic

Statistical power is related to effect size estimation ¢ ¢ ¢ Statistical power is

Statistical power is related to effect size estimation ¢ ¢ ¢ Statistical power is the probability that the test statistic exceeds a critical value. The (chi-squared) test statistic is approximately l (Estimated effect size)2/(Variance of estimate) The expected values of the test statistic under H 1 and under H 0 differ by l ¢ ¢ (True effect size)2/(Variance of estimate) This quantity is known as the non-centrality parameter (NCP) 1/(Variance of estimate) is known as Fisher’s statistical “information”

Power calculation via NCP Sample size N Effect size e Allele frequency p NCP

Power calculation via NCP Sample size N Effect size e Allele frequency p NCP Power α The NCP is linearly related to N, e 2 and p(1 -p) NCP is often a convenient intermediate step in calculating power Monotonic but non-linear relationship between NCP and Power

Power and NCP (df=1) a= 0. 01, 0. 0001, 0. 000001, 0. 0000001 Power

Power and NCP (df=1) a= 0. 01, 0. 0001, 0. 000001, 0. 0000001 Power NCP

Example: Quantitative phenotype Linear regression model: Y = α + βX + ε X

Example: Quantitative phenotype Linear regression model: Y = α + βX + ε X is genotype, coded as 0, 1, 2 H 0: β=0, usually t-test or F-test In large samples, t ≈ Normal, F ≈ Chi-squared

Power loss from indirect association ¢ ¢ ¢ NCP also simplifies power calculation when

Power loss from indirect association ¢ ¢ ¢ NCP also simplifies power calculation when the test SNP does not have direct effect on the phenotype but is in LD with one that does (indirect association) If the LD between the test SNP and the causal SNP has magnitude r 2, then the NCP at the test SNP is equal to the NCP at the causal SNP attenuated by a factor of r 2 In other words, the sample size to achieve equivalent power is increased by a factor of 1/r 2 Sham et al, Am J Hum Genet 2000

Power gain from extreme phenotypic selection Under a polygenic model, selecting individuals with extreme

Power gain from extreme phenotypic selection Under a polygenic model, selecting individuals with extreme (very low or very high) phenotypic values for genotyping can improve study efficiency NCPS / NCPP = Var. S / Var. P

A simple genetic power calculation tool Genetic power calculator http: //pngu. mgh. harvard. edu/~purcell/gpc/

A simple genetic power calculation tool Genetic power calculator http: //pngu. mgh. harvard. edu/~purcell/gpc/ Purcell, Cherny and Sham, Bioinformatics, 2003

What is meta-analysis? ¢ ¢ ¢ Literally means analysis of analyses, sometimes known as

What is meta-analysis? ¢ ¢ ¢ Literally means analysis of analyses, sometimes known as “quantitative literature review” Multiple studies have tried to answer a question, but none is large enough to provide a definitive answer The studies may collectively contain enough information to provide a definitive answer, if only their data can be combined. However, it may be very laborious (or impossible) to obtain, combine and analyze the raw data from all the studies Fortunately, most of the relevant information of the studies are captured in the usually reported summary statistics e. g. test statistic, p-value, effect size estimate and its standard error Meta-analysis combines such summary statistics to address the research question

Need for meta-analysis in complex disease genetics ¢ ¢ ¢ Complex disorders are polygenic

Need for meta-analysis in complex disease genetics ¢ ¢ ¢ Complex disorders are polygenic with many variants each contributing a small effect Most individual studies are under-powered Meta-analysis of summary statistics from individual studies offers a way of enhancing power and producing robust (i. e. replicable) association results

Inverse-variance method The most common method of meta-analysis is to combine the effect size

Inverse-variance method The most common method of meta-analysis is to combine the effect size estimates (e. g. ln. OR), weighting each estimate by the inverse of its variance. The overall estimate, b, is given by b = (b 1/v 1 + b 2/v 2 + b 3/v 3 + …. ) / (1/v 1 + 1/v 2 + 1/v 3 + …. ) The variance of this overall estimate is given by v = 1 / (1/v 1 + 1/v 2 + 1/v 3 + …. ) An overall chi-square test statistic is given by b 2 / v

Weighted β where

Weighted β where

Meta-analysis based on p-values ¢ ¢ ¢ Unfortunately some studies may report p-values and

Meta-analysis based on p-values ¢ ¢ ¢ Unfortunately some studies may report p-values and but effect size estimates One possible solution is to calculate estimated effect sizes (and standard errors) from p-values (this requires knowledge of allele frequencies) and then proceed with the inverse variance method Another solution is to perform meta-analysis based on the p-values directly

Known direction of effect ¢ ¢ If, in addition to the p-value, the direction

Known direction of effect ¢ ¢ If, in addition to the p-value, the direction of each effect is available (i. e. whether the variant allele increases or decrease risk, relative to the reference allele), then an approach based on combination of signed normal test statistics is feasible First convert each p-value to a chi-square statistic, using the inverse chi-square distribution function Then take the positive square root of each chi-square statistic (Z), and flip to the sign to negative if the variant allele decreases risk An overall normal test statistic is obtained by combining the signed test statistics, weighting each statistic by the square root of its sample size, and dividing by the overall sample size

Weighted Z where

Weighted Z where

Unknown direction of effect Fisher’s method: Sum of χ2’s Correlation of p-values from the

Unknown direction of effect Fisher’s method: Sum of χ2’s Correlation of p-values from the two methods ~ 0. 99 Chi-squared statistics can be weighted by sample size Some expected power loss compared to inverse-variance or weighted Z method

Why not random effects? ¢ ¢ ¢ The above methods represent “fixed effect” (FE)

Why not random effects? ¢ ¢ ¢ The above methods represent “fixed effect” (FE) metaanalysis, which assumes a single effect size β underlying all the data Random effects (RE) meta-analysis allows β to differ in different studies / populations Since variation in β is likely to exist, why is FE meta-analysis generally preferred in genetics? H 0: β = 0 for all populations H 0: E(β) = 0 across populations The first H 0 is more appropriate, but the RE model is designed to test the second H 0 Han & Eskin, 2011, AJHG

Some practical issues ¢ ¢ Most of the work in a meta-analysis is the

Some practical issues ¢ ¢ Most of the work in a meta-analysis is the collection and preparation of the summary statistics in a form that can be combined For the summary statistics to be comparable, it is necessary, as far as possible, to ensure: l Uniform phenotype definition l Common set of SNPs (by imputation if necessary) l Consistent calling of the alleles (no flipping) l Same coding scheme for genotypes (e. g. 0, 1, 2) l Uniform analysis method, e. g. logistic regression allowing for principal components

Unambiguous alleles ATCTGGT[A/C]CTCCAT TAGACCA[T/G]GAGGTA ¢ ¢ ¢ A is equivalent to T C is

Unambiguous alleles ATCTGGT[A/C]CTCCAT TAGACCA[T/G]GAGGTA ¢ ¢ ¢ A is equivalent to T C is equivalent to G No ambiguity across datasets, even if some studies label the two alleles as A/C and other studies label them as T/G

Ambiguous alleles ¢ An annoying problem: ATCTGGT[A/T]CTCCAT TAGACCA[T/A]GAGGTA ¢ ¢ Allele A in one

Ambiguous alleles ¢ An annoying problem: ATCTGGT[A/T]CTCCAT TAGACCA[T/A]GAGGTA ¢ ¢ Allele A in one study may be labeled as T in another G/C SNPs have the same problem

Resolving ambiguous alleles ¢ ¢ ¢ Two ambiguous SNP types l A/T and G/C

Resolving ambiguous alleles ¢ ¢ ¢ Two ambiguous SNP types l A/T and G/C Flip alleles if probe sequence is complementary to reference sequence: http: //www. well. ox. ac. uk/~wrayner/strand/ Allele flipping is suspected if there is heterogeneity in l allele frequency – e. g. the “same” allele having frequency 0. 1 in some datasets but 0. 9 in others) l linkage disequilibrium - the “same” allele being in positive LD with a nearby allele in some datasets, but negative LD with the same nearby allele in others l effect size – the “same” allele having a positive effect in some datasets but a negative effect in others.

Formal heterogeneity test: Cochran’s Q

Formal heterogeneity test: Cochran’s Q