2021 09 19 Biostatistics for the biomedical profession

2021 -09 -19 Biostatistics for the biomedical profession Lecture III BIMM 18 Karin Källen & Linda Hartman September 2015 1

Today • Lecture 1: summary measures and graphical methods • Lecture 2: Normal distribution, generalisation, confidence interval, reference interval, t-test, ANOVA • Paired samples t-test • Non-parametric tests • Mann-Whitney’s test 2021 -09 -19 • Repetition • Kruskal-Wallis’ test • Wilcoxon signed rank test 2

Repetition – the normal distribution • • • The mean, median, and mode all have the same value The curve is symmetric around the mean; the skew and kurtosis is 0 The curve approaches the X-axis asymptotically Mean ± 1 SD covers 2∙ 34. 1%=68. 2% of data Mean ± 2 SD covers 2∙ 47. 5%=95% of data Mean ± 3 SD covers 99. 7% of data • Excercise: What is the proportion of babies who will have a head circumference between -1 SD to +1 SD (=z-score -1 to +1)? • 68% 2021 -09 -19 The (perfect) normal distribution 3

Output from SPSS: . Excercise: 1. Are the data normally distributed? 2021 -09 -19 Histogram for birth weight (in the ’births’ data set) 2. Decide the limits between which 95% of all birth weights will be found. 3. Compute a 95% confidence interval for the mean 4

2021 -09 -19 Birth weights cont. 5

Tests of Normality Kolmogorov-Smirnova Statistic df Sig. Birth_weight , 040 262 , 200* *. This is a lower bound of the true significance. a. Lilliefors Significance Correction Shapiro-Wilk Statistic df Sig. , 992 262 , 147 2021 -09 -19 Birth weights cont. Two different methods to test for normal distribution The p-value tells the likelihood for normal distribution 6

Excercise: 1. Are the data normally distributed? Yes, no reason to doubt 2. Decide the limits between which 95% of all birth weights will be found. Clue: 95% of all data with lie between -1. 96 SD and +1. 96 SD. 2021 -09 -19 Mean: 3539 g SD : 542 g N : 262 95% Reference interval: Lower limit: 3539 - 1. 96*542≈ 2477 g Upper limit: 3539+ 1. 96*542≈ 4601 g 3. Compute a 95% confidence interval for the mean Clue: 95%CI: Mean +/- 1. 96 * SEM= s/√n SEM=542/ √ 262 ≈ 33. 5 95% Confidence interval: Lower limit: 3539 -1. 96*33. 5 ≈ 3473 Upper limit: 3539 -1. 96*33. 5 ≈ 3605 Mean with 95%CI: 3539 (3473 -3605) 7

1. How to perform descriptive statitics 2021 -09 -19 But…. What to do if data do not follow a normal distribution? 2. How to compare the results between two samples? 8

2021 -09 -19 Output from SPSS: Maternal BMI (kg/m 2) Excercise: Could a normal distribution be assumed? 9

2021 -09 -19 Maternal BMI, cont… 10

2021 -09 -19 Maternal BMI, cont…. 1. Excercise: Could a normal distribution be assumed? No!!! 2. Which measurements should be used to produce descriptive statistics? Median, inter-quartile range, histogram, box plot etc. 3. Under which circumstances could it be possible to nevertheless compare the means with a t-test? We will repeat and learn more about the SEM (standard error of the mean) 11

Repetition – Central Limit Theorem • no. of observations is large (faster if distribution is symmetric) • Independent observations • from the same distribution We could often use normal distribution to 2021 -09 -19 • Mean has approximately normal distribution if test difference in mean – even if observations are not normal 10000 samples of mean values from dice-rolls Based on • 10 rolls • 100 rolls 12

Mean 420 Std 400 Median 280 QL-QU 130 -550 2. 5 -97. 5 percentile 70 -1600 2021 -09 -19 Repetition: Normal distribution Excercise: Symmetric or assymetric? Mean or median to describe data? Use the median! 13 Example from Björk, Praktisk statistik för medicin och hälsa

Normal distribution: Each mean based on samples with n=10 N=10 10 observations of CB-153 was sampled 1000 times and the mean (of the 10 observations) was calculated. Histogram of the 1000 means. N=50 Mean 420 Std 400 Median 280 QL-QU 130 -550 2. 5 -97. 5 percentile 70 -1600 N=20 2021 -09 -19 Original dataset N=100 14

Exercise - CLT N=10 N=50 Mean 420 Std 400 Median 280 QL-QU 130 -550 2. 5 -97. 5 percentile 70 -1600 N=20 2021 -09 -19 Original dataset N=100 15 SEM=s/√n

Normal distribution: = 420 SEM Original dataset 420 Std 400 Median 280 QL-QU 130 -550 2. 5 -97. 5 percentile 70 -1600 = 89 = 126 2021 -09 -19 Mean = 420 SEM = 57 = 40 16 SEM=s/√n

2021 -09 -19 Maternal BMI, cont…. 1. Excercise: Could a normal distribution be assumed? No!!! 2. Under which circumstances could it be possible to nevertheless compare the means with a t-test? If the samples are large enough (at least approx 60 -100), we could nevertheless compare the means). 17

• • Normal distribution Estimate of mean: Confidence interval Hypothesis testing Comparison of means: 2021 -09 -19 Repetition: • Conf int for difference in means (2 groups) • T-test (2 groups) • ANOVA (> 2 groups) 18

�A confidence interval tells us within which interval the ’true’ estimate of a parameter probably lies�E. g. , a 95% confidence interval tells us between which limits the ’true’ estimate (with 95% certainty) lies. 2021 -09 -19 Confidence interval �Repetition: 95% of the data will lie between +/- 2 SD (1. 96 exactly). �A 95% CI could be constructed (large samples): (mean-1. 96*SEM to mean+1. 96*SEM) 19

• Confidence grade 95% =100 %- 5% • i. e. 5% = 1/20 intervals (produced in the same way) will not cover the true value! 2021 -09 -19 Confidence interval 20 True value

�A 95% confidence interval tells us between which limits the ’true’ estimate of the mean (with 95% certainty) lies: (mean-1. 96*SEM to mean+1. 96*SEM) 2021 -09 -19 Confidence intervals and reference intervals �A reference interval reflect the interval within which 95% of the population (or values) lies Example: Lower limit: mean – 1. 96 * s Upper limit: mean +1. 96 * s 21

Confidence interval vs Reference interval 2021 -09 -19 The graphs are based on approx 1 000 births. 22

The distribution of birth weight in two samples (n=100 and n=1000, respectively). n=100 m=3477 g, s=555 g n=1000 m=3507 g, s=580 g Excercise: The variance seems to be larger in the larger sample. Is that remarkable?

Exercise 1 The larger the investigation, the narrower is the reference interval 2021 -09 -19 Which of the following statements are true? 2 The sample mean is always within the limits of the confidence interval 3 The population (true) mean is always within the limits of the confidence interval 4 The larger the investigation, the wider is the confidence interval 5 A confidence interval with 99% confidence grade is always wider than the corresponding confidence interval with 95% confidence grade 2 and 5 are correct 24

T-distribution N-1=”degrees of freedom” Degrees of freedom t-constant for 95% CI 5 2. 57 9 2. 26 19 2. 09 29 2. 02 49 2. 01 99 1. 98 1. 96 2021 -09 -19 • 25

2021 -09 -19 The quantiles for T could be looked up in tables… But are rather produced by computer programs As SPSS Built-in in t-test & CIs 26

• Birth weight • Two groups A: Smokers B: Non-smokers 2021 -09 -19 T-test for two independent samples (groups) - Example 27

• • Normal distribution Estimate of mean: Confidence interval Hypothesis testing Comparison of means: 2021 -09 -19 Repetition: • T-test/Conf int for difference in means (2 groups) • ANOVA (> 2 groups) 28

T-test 1. The mean is a relevant summary measure 2. Independent observations (e. g. no patient contributes more than one observation) 3. Observations are of Normal distribution OR Both groups are large 2021 -09 -19 Assumptions 29

2021 -09 -19 T- test Example: Birth weight, Descriptives 30

• Test variable: D = Mean in group B - Mean in group A • H 0: D = 0, Mean in group A = Mean in group B 2021 -09 -19 Test procedure for t-test • H 1: D 0, Mean in group A Mean in group B • Construct a confidence interval for D and/or Calculate the p-value 31

• 2021 -09 -19 Confidence interval - General formula 32

T-test for two independent groups (cont. ) 95% CI: 2021 -09 -19 • D=3586 -3432= 154 g n. A=183 n. B=66 • For now, we assume that the standard deviation is the same in both group A and B Base the analysis on a weighted (”pooled”) standard deviation s. Pooled 33

T-test for two independent groups (cont. ) s. B=478 2021 -09 -19 s. A=554 535 • Use constant c for 95% confidence level (5% risk level) with n. A - 1 + n. B - 1 = 247 degrees of freedom c 1. 97 (obtained from statistical table for the t-distribution) 34

2021 -09 -19 T-test for two independent groups (cont. ) The computer makes the calculations for us…. But we have to interpret the results! 35

• D c * SE 154 1. 97 * 77 154 151 • 95% CI for the mean difference in birth weight is 3 - 305 g 2021 -09 -19 Discuss: How do you interpret the confidence interval? 2. Is there a significant difference in Birth weight? 3. What can you say about the corresponding p-value? 36

T-test for two independent groups • Two different test versions depending on if equal standard deviation (variance) can be assumed or not 2021 -09 -19 Example of SPSS-output P-values for the t-tests Levene’s test: p-value (”Sig. ”) testing H 0: Variance in A = Variance in B If not low (e. g. if p>. 1) read from the upper row. Difference with 95% CI 37

Presenting t-test results Mean (possibly median as well for comparison) ± 95% CI sometimes relevant ± SE (i. e. 68% CI) not relevant 2021 -09 -19 • Average in each group • Variability in each group Standard deviation Percentiles if report space permits • Mean difference between the groups ± 95% CI usually relevant P-value. Value of t-variable usually not relevant. 38

• The p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis (H 0) is true. 2021 -09 -19 Elements of statistical inference • Type I error (often referred to as alpha) is the probability of rejecting H 0 when in fact H 0 is true. • Type II error (often referred to as beta) is the probability of accepting H 0 when in fact H 0 is false. 39

• The p-value does not tell us anything about the size of the effect. Only how probable it is to obtain an effect of the size in our sample if the null hypothesis is true • The P-value is a function of both sample size and of true effect size 2021 -09 -19 Important notes…. • With large samples, statistically significant results could be found even if the size of the absolute effects are so small that they are of no clinical interest. • The fact that no significant results were found, does not mean that no difference exists. Perhaps the study had to low power to detect a true difference/effect/association. 40

Statistical significance vs. clinical relevance Statistical significance: ”There is a difference” 2021 -09 -19 • Low p-value • How large is the difference? Clinical relevance: ”Is the difference of importance? ” Effect estimation (CI) is needed! 41

Answer: 1. E (or B) 2. B 3. C 4. A 5. D No effect 2021 -09 -19 Exercise – statistical inference Clinically relevant effect 95% Confidence intervals around the effect measure, and p-values for the null hypothesis of ”no effect” in 5 investigations Make pairs of the statements and study results (A-E) in the Figure 1. Treatment effect cannot be detected, but cannot be ruled out 2. A clinically relevant effect is indicated, but is statistically uncertain 3. Treatment effect is statistically significant, uncertain if the effect is of clinical relevance 4. Clinically relevant effect that is statistically significant 5. Treatment effect is statistically significant, but a clinically relevant difference can be ruled out Fr Jonas Björk: Praktisk statistik för medicin och hälsa 42

Excercise: Combine the statistical terms with the correct common phenomenons (in common language) • Type I error • Type II error • Confounding • Non-causal association • Mass-significance • Lack of power

Multiple T-tests could result in mass-significance! Do ANOVA instead of repeated T-tests. • • • ANOVA: H 0: Mean 1=mean 2=mean 3 H 1: At least two of the means are different • In short: In an ANOVA, the total variance is devided into the within-groups, and between-groups variance. 2021 -09 -19 More than two groups, one way ANOVA (analysis of variance) 44

ANOVA • Compare variances 2021 -09 -19 • Between groups (VB) • Within groups (VW) VB VW Ratio VB/VW Large Small The quotient (F=VB/VW) is equal to 1 if group means are equal and >1 if they are not. The corresponding test is called an F-test – and is based on the F-distribution 45

2021 -09 -19 Example: ANOVA – to compare the birth weight between 4 parity groups 46

2021 -09 -19 Example: ANOVA – to compare the birth weight between 4 parity groups 47

2021 -09 -19 Example: ANOVA birth weight and parity, continued Significance of the test: Are all the means the same? (m 1=m 2=m 3=m 4) To check for pair-wise differences, post-hoc test could be performed 48

Different methods to adjust for multiple comparison 2021 -09 -19 Post hoc tests 49

Presenting ANOVA results Mean (possibly median as well for comparison) ± 95% CI sometimes relevant (± SE (i. e. 68% CI) not relevant) 2021 -09 -19 • Average in each group • Variability in each group Standard deviation Percentiles if report space permits • P-value together with ANOVA-table if report space permits, otherwise ”F(df 1, df 2)=…” 50

2021 -09 -19 PAIRED SAMPLES & NON-PARAMETRIC METHODS 51

T-test for paired data Controls. Day 2 915600 953300 650000 700000 1050000 984000 772000 920000 1080000 920000 840000 533000 510000 722000 Sal. Day 2 357800 502200 470000 560000 736000 556000 418000 600000 680000 520000 560000 620000 704000 696000 Two ways of comparing means: 1. Calculate the means of the groups, and estimate the difference 2. Estimate the difference for each row. Then calculate the mean of the differences 2021 -09 -19 Preparation 11 2 3 4 5 6 7 8 9 10 11 12 13 14 Means Difference between values 52

Preparation 11 2 3 4 5 6 7 8 9 10 11 12 13 14 Means Controls. Day 2 915600 953300 650000 700000 1050000 984000 772000 920000 1080000 920000 840000 533000 510000 722000 824992, 8571 Sal. Day 2 357800 502200 470000 560000 736000 556000 418000 600000 680000 520000 560000 620000 704000 696000 570000 Difference between values 557800 451100 180000 140000 314000 428000 354000 320000 400000 280000 -87000 -194000 26000 2021 -09 -19 T-test for paired data 53

Preparation 11 2 3 4 5 6 7 8 9 10 11 12 13 14 Means Controls. Day 2 915600 953300 650000 700000 1050000 984000 772000 920000 1080000 920000 840000 533000 510000 722000 824992, 8571 Sal. Day 2 357800 502200 470000 560000 736000 556000 418000 600000 680000 520000 560000 620000 704000 696000 570000 Difference between values 557800 451100 180000 140000 314000 428000 354000 320000 400000 280000 -87000 -194000 26000 254992, 9 Difference between means= mean of the differences 2021 -09 -19 T-test for paired data 54

Preparation 11 2 3 4 5 6 7 8 9 10 11 12 13 14 Means Controls. Day 2 915600 953300 650000 700000 1050000 984000 772000 920000 1080000 920000 840000 533000 510000 722000 824992, 8571 Sal. Day 2 357800 502200 470000 560000 736000 556000 418000 600000 680000 520000 560000 620000 704000 696000 570000 Difference between values 557800 451100 180000 140000 314000 428000 354000 320000 400000 280000 -87000 -194000 26000 254992, 9 s= 181454, 0097 111808 216636, 9 s (combined) SEM 150709, 3394 56962, 77603 57898, 64 2021 -09 -19 T-test for paired data 55

Preparation 11 2 3 4 5 6 7 8 9 10 11 12 13 14 Means Controls. Day 2 915600 953300 650000 700000 1050000 984000 772000 920000 1080000 920000 840000 533000 510000 722000 824992, 8571 Sal. Day 2 357800 502200 470000 560000 736000 556000 418000 600000 680000 520000 560000 620000 704000 696000 570000 Difference between values 557800 451100 180000 140000 314000 428000 354000 320000 400000 280000 -87000 -194000 26000 254992, 9 s= 181454, 0097 111808 216636, 9 Thus, the mean is not influenced on whether the data are paired or not, but the estimate of the standard deviation is likely to differ with method. 2021 -09 -19 T-test for paired data Use analyses for paired data when adequate! s (combined) SEM 150709, 3394 56962, 77603 57898, 64 56

Paired samples t-test Sometimes it is more powerful to test for differences within the same patient (or another paired measurement) In a study of weight loss from spicy food, 12 subjects were weighed before and after a month on spicy food diet, see the table 2021 -09 -19 Previous t-test was made to find differences between independent groups of observations Discuss: How would you test if the diet gave weightloss? 57

Paired samples t-test Calculate the differences di for each subject’s weights. 2021 -09 -19 Do a paired samples ttest! Test if mean(d) = 0 Discuss: • How do you interpret the CI? • Was the treatment effective 58

Paired test: 95% CI for weight loss (-4. 0, -0. 17) P=0. 036 If the researchers wouldn’t recognize the paired design, but did an independent groups’ t-test: Why so wide? CI = (-18. 6; 22. 76) Large variability BETWEEN subjects inflates the P=0. 84 variability of the difference in an independent groups’ design! 2021 -09 -19 Paired samples t-test 59

Exercise: Creatinine was measured in 11 men and 12 women: Women (n. B = 12) 2021 -09 -19 Men (n. A = 11) 60 • What test would you use to test if there is a difference? • T-test? Are the assumptions of the test met?

Normally distributed outcomes/’large’ studies Focus on mean comparisons • Two independent groups t-test • Paired groups (paired measurements) Paired t-test • > 2 groups Analysis of variance (ANOVA) Regression analysis What if assumptions are not met? Non-parametric tests! 2021 -09 -19 Parametric methods for group comparisons 61

2021 -09 -19 COMPARISON OF MEDIANS – NON-PARAMETRIC TESTS 62

• Original measurements are converted to ranks in the analysis • H 0: Distributions are equal in all groups Median useful marker for differences in distribution 2021 -09 -19 Non-parametric methods • Insensitive to skewed distributions, extreme values • Can be used for ordinal data E. g. 0 = No response, 1 = Mild response, 2 = Strong response 63

Difference between two independent groups: Mann-Whitney’s test • Rank the observations from the lowest to highest • Calculate rank sum in group A (WA) and in group B (WB) Straightforward generalization to more than two groups (Kruskal. Wallis test) • The larger the difference is in mean ranks WA/n. A and WB/n. B , the lower p-value will be • Mann-Whitney 64 Another name for the same test is ”Wilcoxon Rank sum test” which utilizes WA 2021 -09 -19

Mann-Whitney… • Calculate the rank sum and the mean rank for males (and females if you have time) • For the group sizes n. A = 11 (males) and n. B = 12 (females), p < 0. 05 if the rank sum for the smallest group (males) is below 100 or above 175 2021 -09 -19 Small Group Discussion. . . • Conclusion? • How would you summarize the test? 65

Presenting Mann-Whitney results • Median (possibly mean as well for comparison) • Variability in each group • Percentiles or quartiles (in smaller groups) or min-max (in even smaller groups) • Standard deviation not relevant 2021 -09 -19 • Average in each group • Difference between the groups • P-value for M-W test • U-statistic sometimes relevant • Ideally: Median difference ± 95% CI sometimes relevant (could be calculated in e. g. SPSS) 66

2021 -09 -19 Mann-Whitney Creatinine, cont 67

Extension to more than two groups Kruskal-Wallis test • Mann-Whitney U-test (k = 2) H 0: Distribution A = Distribution B H 1: Distribution A Distribution B 2021 -09 -19 E. g. • Kruskal-Wallis test (k > 2 groups) E. g. k = 3: H 0: Distribution A = Distribution H 1: Distribution A Distribution B = Distribution C B or C • Independent groups, independent observations within each group • Median useful marker for differences in distribution • The more the mean ranks differ, the lower the p-value will be 68

2021 -09 -19 NON-PARAMETRIC METHODS PAIRED SAMPLES 69

Non-parametric test for paired samples: Wilcoxon signed rank test Spicy diet continued: Subject 11 22 33 44 55 66 77 88 99 10 10 11 11 12 12 Pre Post Diff Sign Rank 65 65 88 88 125 103 90 90 76 76 85 85 126 97 97 142 132 110 62 62 86 86 118 105 91 91 72 72 81 81 122 95 95 145 132 105 -3 -3 -2 -2 -7 -7 22 11 -4 -4 -4 -2 -2 33 00 -5 -5 Signed rank ---++ ++ ----++ 5, 5 33 11 11 33 11 88 88 88 33 5, 5 -5, 5 -3 -3 -11 33 11 -8 -8 -8 -3 -3 5, 5 -- 10 10 -10 (Sum the negative ranks =56. 5 (=11*12/2 -9. 5) ) … 2021 -09 -19 Sum the positive ranks: 3+1+5. 5=9. 5 70

Wilcoxon signed rank test -Some remarks • Effect can be summarized by the median of the paired differences together with 95% CI 2021 -09 -19 • Works for all types of distributions and for all study sizes • Almost as powerful as the (paired) t-test if the differences are normally distributed • Power is decreased if many differences are zero 71

Test-situation Parametric test Non-parametric test Independent samples, 2 groups T-test Mann-Whitney Independent ANOVA samples, ≥ 2 groups Kruskal-Wallis Paired samples, 2 groups Wilcoxon rank sum test Paired t-test 2021 -09 -19 Comparison of different tests 72

Two broad categories of statistical methods Positive T-test + + Nonparametric methods Mann. Whitney + + + Negative Results in both effectmeasure (w CI) and pvalue. 2021 -09 -19 Parametric methods Ex More effective to detect differences if data is (close to) normal No assumptions about the distribution of data useful also for data measured on an ordinal scale Suitable for small studies - - Less powerful than parametric methods normal distribution applies) (if Typically results only in p -value (but sometimes an effect measure with CI could be computed) 73

• Repetition, Normal distribution, Reference interval, SEM, T-test, ANOVA • Paired t-test • Non-parametric methods • Mann-Whitney • Kruskal-Wallis • Wilcoxon signed rank test 2021 -09 -19 Summary: Next lecture: Subject 2 x 2 Table. Chi 2 -test, Fisher exact test Probability, proportions Linear regression Correlation R 2 74