BIOS 501 Lecture 7 Hypothesis Testing Roderick Little

BIOS 501 Lecture 7 Hypothesis Testing Roderick Little UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Hypothesis testing o o A hypothesis test is a common alternative to CI’s for making inferences about population quantities. Confidence Interval: a set of values of a population parameter consistent with the data. Hypothesis test: assesses consistency of the data with a particular null value of the parameter For example, for inference about a mean n n o CI: set of values of the mean consistent with the data Hypothesis test: are the data consistent with a particular value of the mean? Often the null value corresponds to “no difference” or “no association” UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Hypothesis testing o o o For a hypothesis test, a scientific hypothesis is converted into a null hypothesis about the value or values of one or more parameters. The key output of a hypothesis test is a P-value between 0 and 1 that measures whether the observed data are consistent with the null hypothesis. n Small P-Value (say less than 0. 05) indicates evidence against the null: either the null hypothesis is false or an unlikely event has occurred. The null hypothesis is "rejected“ n Large P-Value indicates lack of evidence against the null. The null hypothesis is "accepted", or more precisely, "not rejected". Important: “Accepting" the null hypothesis does not imply that the null hypothesis is true, only that data do not contradict it. UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Elements of a hypothesis test o o o A scientific hypothesis, e. g. “new treatment is better than old treatment” An associated null hypothesis , capable of being addressed by a statistical test. The null hypothesis is often counter to the scientific hypothesis, e. g. “the average difference in outcomes between treatments is zero”. An alternative hypothesis : legitimate values of the parameter if is not true. A test statistic T computed from the data, which (a) has a known distribution if the null hypothesis is true and (b) provides information about the truth of the null hypothesis. The P-Value for the test is: Small P-values are evidence against the null hypothesis UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

More on P-Value UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Strength of evidence against null As measures of statistical evidence, we can informally divide P-Values into classes, as follows: n P < 0. 01: strong evidence against null n 0. 01 < P < 0. 05: moderate evidence against null n 0. 05 < P < 0. 1: at best marginal evidence against n P > 0. 1: data consistent with null, different values above 0. 1 (e. g. 0. 2, 0. 7) have little impact on conclusions Note that smaller deviations from the null can be detected with larger sample sizes, so P-Value is strongly dependent on sample size – it is not a good measure of the size of the effect. UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Significance level o q q A common convention is to set a cut off value , and formally “reject” the null hypothesis if , “accept” the null hypothesis if The cut-off is called the “significance level”, “size” or “type 1 error” of the test, and has the property that The choice of significance level is somewhat arbitrary; a typical value by convention is 0. 05. UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

More on significance level o o o The choice = 0. 05 is so pervasive that some journals require a significant result at the 5% level to demonstrate that an apparent effect is real and not due to chance, and hence merits publication. Any choice of cut-off is arbitrary, and P = 0. 049 is not substantively different from P=0. 051. So in a sense it is more informative to just report the P-value It is a bad idea to publish only statistically significant results, since this leads to publication bias n n for interpretation need to know about negative studies too journals should report results from methodologically sound studies that address important questions, whether or not significant UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Examples from Creagan et al (1979) o o o “There was no significant difference between the (Vit C and placebo) groups (log-rank test; P=0. 61)” “The 27 patients who did not have treatment had significantly worse (log-rank test; P=0. 017) survival than the 123 patients who did receive the medication” The log-rank test has the null hypothesis that the two groups compared have the same survival distributions UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Example: coin tossing Scientific hypothesis: coin is fair, that is, equally likely to come up heads or tails. Data: toss the coin n times; record x = number of heads. (Medical analogy: cross-over design with two treatments A and B; head = A better than B, tail = B better than A, discard ties. Fair coin = A and B equally effective. ) Null hypothesis: Pr(Head) = 0. 5 Alternative Hypothesis: Pr(Head) < 0. 5 or Pr(Head) > 0. 5 (two-sided) Test Statistic: number of heads, x. Values of x "close" to n/2 are consistent with Null, values far from n/2 are not consistent with the Null. P-Value: probability of values equal to or more extreme (i. e. further from n/2 than observed value x. E. g. for n = 16, x = 5: P-value = Pr(x=0) + Pr(x=1) =. . . + Pr(x=5) + Pr(x=11) + Pr(x=12) +. . . + Pr(x=16) = 0. 21. (Details omitted, uses binomial distribution) Conclusion: since P-value > 0. 05, null hypothesis is “accepted” at significance level. No evidence that coin is not fair. UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Clinical and statistical significance The P-Value concerns statistical significance, likelihood that an observed effect is due to chance. It is potentially very misleading as a measure of clinical significance, that is the clinical importance of an observed effect. In the coin-tossing example, the result x = 5, n = 16 is statistically insignificant (P = 0. 21) for testing the null hypothesis , although the observed proportion of heads (5/16 = 0. 31) is substantially different from the null value of one half. On the other hand, the result x = 750, n = 1600 is highly statistically significant (P < 0. 01 ), although the observed proportion of heads (750/1600 = 0. 47) is quite close to the null value 0. 5, and the difference may not be regarded as clinically significant. Here is a real example of confusion between clinical and statistical significance … UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Can warfarin be continued during dental extraction? Results of a randomized controlled trial o I. L. Evans, M. S. Sayers, A. J. Gibbons, G. Price, H. Snooks, A. W. Sugar. Brit. J. Oral & Maxillofacial Surgery (2002) 40, 248– 252 o SUMMARY. A randomized controlled trial was set up to investigate whether patients who were taking warfarin and had an International Normalised Ratio (INR) within the normal therapeutic range require cessation of their anticoagulation drugs before dental extractions. Of 109 patients who completed the trial, 52 were allocated to the control group (warfarin stopped 2 days before extraction) and 57 patients were allocated to the intervention group (warfarin continued). The incidence of bleeding complications in the intervention group was higher (15/57, 26%) than in the control group (7/52, 14%) but this difference was not significant… we found no evidence of an increase in clinically important bleeding. As there are risks associated with stopping warfarin, the practice of routinely discontinuing it before dental extractions should be reconsidered. o o o UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Clinical vs statistical significance o “Incidence of bleeding complications in the intervention group was higher (15/57, 26%) than in the control group (7/52, 14%) but this difference was not significant. . . we found no evidence of an increase in clinically important bleeding. ” n n n Is 26% vs 14% clinically significant? 95% confidence interval for difference in proportions = (0, 0. 28) (details on this CI later) Study seems underpowered (sample size too small) a common problem in clinical trials UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

T-Test for a Mean of a Single Sample. UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Single sample t test: notes o o If the data are normally distributed and is true, the test statistic is t distributed with n -1 degrees of freedom. As for CI’s, normal assumption is not critical if n is large. If n is small and values are not normal, the P-Value will be distorted. One possibility in such cases is to transform the data to look more normal (e. g. take logarithms). Another possibility is to use a nonparametric test that does not rely on normality (discussed later). Since only certain percentiles of the t distribution are tabulated, computation of the P-Value requires interpolation. An acceptable alternative is to report an interval for P, e. g. 0. 01 < P < 0. 05 or P > 0. 1. As for CI’s, normal tables can be used instead of t tables if n is large (say n >30). UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Example: Cameron and Pauling article, Stomach Cancer, Test Cases UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

One sided vs Two sided tests o o o The alternative hypothesis describes the set of values of the parameter entertained when the null hypothesis is false Null hypothesis: Two sided alternative: n o One sided alternative: n o o Values on both sides of the null are included Values on one side (here greater) are included For symmetric null distributions, a one sided test often reduces the P-Value by a factor of 2, since the area in only one tail of the distribution is counted Opinions differ on when to use 1 -sided vs 2 -sided tests. In general, I prefer 2 -sided tests unless values on one side are really impossible. UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Example: Cameron and Pauling article, Stomach Cancer, Test Cases, 1 -sided test UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Relationship Between Tests and CI’s o o CI = range of values of the parameter that are consistent with the data, or the range of null values of the parameter that would not be rejected in a significance test. One might conduct a significance test by seeing if the CI includes the null value. For the t test this yields the same answer as two-sided test. Formally, it can be shown that That is, the two-sided test with significance level is accepted (rejected) if and only if the confidence interval with confidence coefficient (1 - ) includes (excludes). A similar relationship exists between t-tests and confidence intervals for t inference about other parameters. For other tests, this relationship is only approximate UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

CP Data. Stomach Cancer Test Cases o o o Given that the t-test can be derived from the CI by the relationship noted above, why bother with tests at all? Tests only require knowledge of the distribution of the test statistic when the null hypothesis is true. For some problems this enables tests to be constructed when CI’s are much harder to construct. P-value is more easily found by the test. UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Power and Type II Error of a Test o Power is a measure of information for a test, analogous to the width of a CI. UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Power calculations o o Often studies are designed so that a test of the null hypothesis yields a pre-specified power (e. g. 80%) for detecting a particular size of effect (the value under the alternative hypothesis) for a test with prespecified size (e. g. 5%) That is, find a sample size such that UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH

Conclusion o o Hypothesis test of null value of a parameter P-value small: evidence against the null P-value large: no evidence against the null Note: large P-value does not mean the null hypothesis true, just that the data do not contradict it! UNIVERSITY OF MICHIGAN SCHOOL OF PUBLIC HEALTH