Common Statistical Analyses Theory behind them Bandit Thinkhamrop
Common Statistical Analyses Theory behind them Bandit Thinkhamrop, Ph. D. (Statistics) Department of Biostatistics and Demography Faculty of Public Health Khon Kaen University, THAILAND
Statistical inference revisited Statistical inference use data from samples to make inferences about a population 1. Estimate the population parameter • Characterized by confidence interval of the magnitude of effect of interest 2. Test the hypothesis being formulated before looking at the data • Characterized by p-value
Sample n = 25 X = 52 SD = 5 Population Parameter estimation [95%CI] Hypothesis testing [P-value]
Z = 2. 58 Z = 1. 96 Z = 1. 64 Parameter estimation [95%CI] : Sample n = 25 X = 52 SD = 5 SE = 1 Population 52 -1. 96(1) to 52+1. 96(1) 50. 04 to 53. 96 We are 95% confidence that the population mean would lie between 50. 04 and 53. 96
Sample n = 25 X = 52 SD = 5 SE = 1 Population Hypothesis testing H 0 : = 55 HA : 55 Z = 55 – 52 1 3
52 -3 SE 55 +3 SE Hypothesis testing H 0 : = 55 HA : 55 Z = 55 – 52 3 P-value = 1 -0. 9973 = 0. 0027 1 If the true mean in the population is 55, chance to obtain a sample mean of 52 or more extreme is 0. 0027.
Calculation of the previous example based on t-distribution Stata command to find t value for 95%CL. di (invttail(24, 0. 025)) 2. 0638986 Stata command to find probability. di (ttail(24, 3))*2 . 00620574 Web base stat table: http: //vassarstats. net/tabs. html or www. stattrek. com
Revisit the example based on t-distribution (Stata output) 1. Estimate the population parameter Variable | Obs Mean Std. Err. [95% Conf. Interval] -------+-------------------------------| 25 52 1 49. 9361 54. 0639 2. Test the hypothesis being formulated before looking at the data One-sample t test ---------------------------------------| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] -----+----------------------------------x | 25 52 1 5 49. 9361 54. 0639 ---------------------------------------mean = mean(x) t = -3. 0000 Ho: mean = 55 degrees of freedom = 24 Ha: mean < 55 Pr(T < t) = 0. 0031 Ha: mean != 55 Pr(|T| > |t|) = 0. 0062 Ha: mean > 55 Pr(T > t) = 0. 9969
Mean one group: T-test 1. Hypothesis H 0: = 0 Ha: 0 2. Data a 1 2 2 5 5 3. Calculating for t-statistic 4. Obtain p-value based on t-distribution 5. Make a decision P-value = 0. 023 Stata command . di (ttail(4, 3. 59))*2 . 02296182 Reject the null hypothesis at level of significant of 0. 05 The mean of y is statistically significantly different from zero.
Mean one group: T-test (cont. ) One-sample t test ---------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] -----+----------------------------------y | 5 3. 83666 1. 870829. 6770594 5. 322941 ---------------------------------------mean = mean(y) t = 3. 5857 Ho: mean = 0 degrees of freedom = 4 Ha: mean < 0 Pr(T < t) = 0. 9885 Ha: mean != 0 Pr(|T| > |t|) = 0. 0231 Ha: mean > 0 Pr(T > t) = 0. 0115
Comparing 2 means: T-test 1. Hypothesis H 0: A = B Ha: A B 2. Data a b 1 2 2 5 5 5 9 9 8 9 3. Calculating for t-statistic 4. Obtain p-value based on t-distribution 5. Make a decision P-value = 0. 002 (http: //vassarstats. net/tabs. html) Reject the null hypothesis at level of significant of 0. 05 Mean of Group A is statistically significantly different from that of Group B.
T-test Two-sample t test with equal variances ---------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] -----+----------------------------------a | 5 3. 83666 1. 870829. 6770594 5. 322941 b | 5 8. 7745967 1. 732051 5. 849375 10. 15063 -----+----------------------------------combined | 10 5. 5. 9916317 3. 135815 3. 256773 7. 743227 -----+----------------------------------diff | -5 1. 140175 -7. 629249 -2. 370751 ---------------------------------------diff = mean(1) - mean(2) t = -4. 3853 Ho: diff = 0 degrees of freedom = 8 Ha: diff < 0 Pr(T < t) = 0. 0012 Ha: diff != 0 Pr(|T| > |t|) = 0. 0023 Ha: diff > 0 Pr(T > t) = 0. 9988
Mann-Whitney U test Wilcoxon rank-sum test Two-sample Wilcoxon rank-sum (Mann-Whitney) test group | obs rank sum expected -------+----------------1 | 5 16 27. 5 2 | 5 39 27. 5 -------+----------------combined | 10 55 55 unadjusted variance adjustment for ties adjusted variance 22. 92 -1. 25 -----21. 67 Ho: y(group==1) = y(group==2) z = -2. 471 Prob > |z| = 0. 0135
Comparing 2 means : ANOVA Mathematical model of ANOVA X= + + X = Grand mean + Treatment effect + Error X=M+T+E X M T E + = + [3 -5. 5] [8 -5. 5] Mean: 3 8 Degree of freedom 1 8 3. Calculating for F-statistic Between groups SST SSE Within groups 4. Obtain p-value based on F-distribution 5. Make a decision P-value = 0. 002 (http: //vassarstats. net/tabs. html) Reject the null hypothesis at level of significant of 0. 05 Mean of Group A is statistically significantly different from that of Group B.
ANOVA 2 groups Analysis of Variance Source SS df MS F Prob > F ------------------------------------Between groups 62. 5 19. 23 0. 0023 Within groups 26 8 3. 25 ------------------------------------Total 88. 5 9 9. 83333333 Bartlett's test for equal variances: chi 2(1) = 0. 0211 Prob>chi 2 = 0. 885
Comparing 3 means: ANOVA 1. Hypothesis H 0: A = B = C Ha: At least one mean is difference 2. Data a 1 2 2 5 5 b 5 9 9 8 9 c 4 4 6 8 4
ANOVA 3 groups (cont. ) Mathematical model of ANOVA X= + + X = Grand mean + Treatment effect + Error X=M+T+E X M = T + E + [3 -5. 4] [8 -5. 4] [5. 2 -5. 4] Mean: 3 8 5. 2 Df: 15 1 2 12 3. Calculating for F-statistic Between groups SST SSE Within groups 4. Obtain p-value based on F-distribution 5. Make a decision P-value = 0. 003 (http: //vassarstats. net/tabs. html) Reject the null hypothesis at level of significant of 0. 05 At least one mean of the three groups is statistically significantly different from the others.
ANOVA 3 groups Analysis of Variance Source SS df MS F Prob > F ------------------------------------Between groups 62. 8 2 31. 4 9. 71 0. 0031 Within groups 38. 8 12 3. 23333333 ------------------------------------Total 101. 6 14 7. 25714286 Bartlett's test for equal variances: chi 2(2) = 0. 0217 Prob>chi 2 = 0. 989
Kruskal-Wallis test Kruskal-Wallis equality-of-populations rank test +------------+ | group | Obs | Rank Sum | |-------+-----| | 1 | 5 | 22. 00 | | 2 | 5 | 61. 50 | | 3 | 5 | 36. 50 | +------------+ chi-squared = probability = 7. 985 with 2 d. f. 0. 0185 chi-squared with ties = probability = 0. 0167 8. 190 with 2 d. f.
Comparing 2 means: Regression 1. Data a b 1 2 2 5 5 5 9 9 8 9 Mean Sum y where 5. 5 Thus x y (x-x)2 (y-y) (x-x)(y-y) 1 1 1 2 2 2 1. 5 1 2 2 5 5 5 9 9 8 9 5. 5 -0. 5 0. 25 0. 25 2. 5 = a + bx b = 12. 5/2. 5 = 5, then = a + 5(1. 5) a = 5. 5 -7. 5 = -2 -4. 5 -3. 5 -0. 5 3. 5 2. 25 1. 75 0. 25 -0. 30 1. 75 1. 25 1. 75 12. 5
Comparing 2 means: Regression Y 10 8 6 4 2 0 -2 a b x a b 1 2 2 5 5 5 9 9 8 9
Comparing 2 means: Regression (cont. ) Y 10 8 6 2 -2 y 1 1 1 2 2 5 5 5 9 9 8 9 2 2 2 4 0 x 1 2 x
Comparing 2 means: Regression (cont. ) Y 10 y = a + bx y = -2 + 5 x y = 8 if x = 2 8 6 4 y = 5. 5; x = 1. 5 b difference of y between x=1 vs. x=2 y = 3 if x = 1 2 0 -2 1 a y = -2 if x = 0 2 x
Regression model (2 means) Source | SS df MS -------+---------------Model | 62. 5 1 62. 5 Residual | 26 8 3. 25 -------+---------------Total | 88. 5 9 9. 83333333 Number of obs F( 1, 8) Prob > F R-squared Adj R-squared Root MSE = = = 10 19. 23 0. 0023 0. 7062 0. 6695 1. 8028 ---------------------------------------y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------+--------------------------------group | 5 1. 140175 4. 39 0. 002 2. 370751 7. 629249 _cons | -2 1. 802776 -1. 11 0. 299 -6. 157208 2. 157208 ---------------------------------------
Regression model (3 means) i. group _Igroup_1 -3 (naturally coded; _Igroup_1 omitted) Source | SS df MS -------+---------------Model | 62. 8 2 31. 4 Residual | 38. 8 12 3. 23333333 -------+---------------Total | 101. 6 14 7. 25714286 Number of obs F( 2, 12) Prob > F R-squared Adj R-squared Root MSE = = = 15 9. 71 0. 0031 0. 6181 0. 5545 1. 7981 ---------------------------------------y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------+--------------------------------_Igroup_2 | 5 1. 137248 4. 40 0. 001 2. 522149 7. 477851 _Igroup_3 | 2. 2 1. 137248 1. 93 0. 077 -. 2778508 4. 677851 _cons | 3. 8041559 3. 73 0. 003 1. 247895 4. 752105 ---------------------------------------
Correlation coefficient • Pearson product moment correlation – Denoted by r (for the sample) or (for the population) – Require bivariate normal distribution assumption – Require linear relationship • Spearman rank correlation – For small sample, not require bivariate normal distribution assumption
Pearson product moment correlation or Indeed it is the mean of the product of standard score.
Scatter plot b 10 8 6 4 2 a 0 1 2 3 4 5 a b 1 2 2 5 5 5 9 9 8 9
Calculation for correlation coefficient(r) [1] x 1 2 2 5 5 Sum Mean SD [2] y 5 9 9 8 9 3 1. 87 8 1. 73 [3] [4] (x-x)/SD (y-y)/SD [3] x [4] -1. 07 -1. 73 1. 85 -0. 53 0. 58 -0. 31 1. 07 0. 00 1. 07 0. 58 0. 62 1. 85
Interpretation of correlation coefficient Correlation Negative Positive None − 0. 09 to 0. 00 to 0. 09 Small − 0. 30 to − 0. 10 to 0. 30 Medium − 0. 50 to − 0. 30 to 0. 50 Strong − 1. 00 to − 0. 50 to 1. 00 These serve as a guide, not a strict rule. In fact, the interpretation of a correlation coefficient depends on the context and purposes. From Wikipedia, the free encyclopedia
The correlation coefficient reflects the non-linearity and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). The figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero. This is a file from the Wikimedia Commons.
Inference on correlation coefficient • Stata commands: . di tanh(-0. 885) -. 70891534. di tanh(1. 887). 95511058
Stata command • ci 2 x y, corr spearman Confidence interval for Spearman's rank correlation of x and y, based on Fisher's transformation. Correlation = 0. 354 on 5 observations (95% CI: 0. 768 to 0. 942) Warning: This method may not give valid results with small samples (n<= 10) for rank correlations.
Inference on correlation coefficient • Or use Stata command . di (ttail(3, 0. 9))*2 . 43445103
Inference on proportion • One proportion • Two proportions • Three or more proportions
One proportion: Z-test 1. Hypothesis 2. Data H 0: 1 = 0 Ha: 1 0 y 1 0 1. . . 0 ny = 50, py = 0. 1 3. Calculating for z-statistic 4. Obtain p-value based on Z-distribution P-value = 0. 018 (http: //vassarstats. net/tabs. html) Stata command to get the p-vale. di (1 -normal(2. 357))*2. 01842325 5. Make a decision Reject the null hypothesis at a level of significant of 0. 05 Proportion of Y is statistically significantly different from zero.
Comparing 2 proportions: Z-test 1. Hypothesis 2. Data H 0: 1 = 0 Ha: 1 0 x y 1 1 0. . . 1 1 0 1. . . 0 x y 0 1 Total n 0 = 50, p 0 = 0. 1 n 1 = 50, p 1 = 0. 4 0 45 5 50 1 30 20 50 Total 75 25 100 3. Calculating for z-statistic 4. Obtain p-value based on t-distribution 5. Make a decision P-value = 0. 0005 (http: //vassarstats. net/tabs. html) Reject the null hypothesis at level of significant of 0. 05 Proportion of Y between group of x is statistically significantly different from each other.
Z-test for two proportions Two-sample test of proportions 0: Number of obs = 50 1: Number of obs = 50 ---------------------------------------Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] -------+--------------------------------0 |. 1. 0424264. 0168458. 1831542 1 |. 4. 069282. 2642097. 5357903 -------+--------------------------------diff | -. 3. 0812404 -. 4592282 -. 1407718 | under Ho: . 0866025 -3. 46 0. 001 ---------------------------------------diff = prop(0) - prop(1) z = -3. 4641 Ho: diff = 0 Ha: diff < 0 Pr(Z < z) = 0. 0003 Ha: diff != 0 Pr(|Z| < |z|) = 0. 0005 Ha: diff > 0 Pr(Z > z) = 0. 9997
Comparing 2 proportions: Chi-square-test H 0: ij = i+ +j where I = 0, 1; j = 0, 1 Ha: ร i+ +j 1. Hypothesis 2. Data y x 0 1 Total 0 45 5 50 1 30 20 50 Total 75 25 100 3. Calculating for 2 -statistic O E (O-E)2/E 45 (75/100) 50 = 37. 50 56. 25 1. 50 5 (25/100) 50 =12. 50 -7. 50 56. 25 4. 50 30 (75/100) 50 =37. 50 -7. 50 56. 25 1. 50 20 (25/100) 50 =12. 50 7. 50 56. 25 4. 50 Chi-square (df = 1) 12. 00 4. Obtain p-value based on t-distribution 5. Make a decision P-value = 0. 001 (http: //vassarstats. net/tabs. html) Reject the null hypothesis at level of significant of 0. 05 There is statistically significantly association between x and y.
Comparing 2 proportions: Chi-square-test | y x | 0 1 | Total ------+-----------+-----0 | 45 5 | 50 1 | 30 20 | 50 ------+-----------+-----Total | 75 25 | 100 Pearson chi 2(1) = 12. 0000 Pr = 0. 001
csi 20 5 30 45, or exact | Exposed Unexposed | Total -----------------+-----------Cases | 20 5 | 25 Noncases | 30 45 | 75 -----------------+-----------Total | 50 50 | 100 | | Risk |. 4. 1 |. 25 | | | Point estimate | [95% Conf. Interval] |------------+------------Risk difference |. 3 |. 1407718. 4592282 Risk ratio | 4 | 1. 62926 9. 820408 Attr. frac. ex. |. 75 |. 3862245. 8981712 Attr. frac. pop |. 6 | Odds ratio | 6 | 2. 086602 17. 09265 (Cornfield) +------------------------1 -sided Fisher's exact P = 0. 0005 2 -sided Fisher's exact P = 0. 0010
Binomial regression. binreg y x, rr Generalized linear models Optimization : MQL Fisher scoring (IRLS EIM) Deviance = 99. 80946404 Pearson = 99. 99966753 No. of obs Residual df Scale parameter (1/df) Deviance (1/df) Pearson Variance function: V(u) = u*(1 -u) Link function : g(u) = ln(u) [Bernoulli] [Log] BIC = = = 100 98 1 1. 018464 1. 020405 = -351. 4972 ---------------------------------------| EIM y | Risk Ratio Std. Err. z P>|z| [95% Conf. Interval] -------+--------------------------------x | 4 1. 833024 3. 03 0. 002 1. 629265 9. 820377 _cons |. 1. 0424262 -5. 43 0. 000. 0435379. 2296851 ---------------------------------------
Logistic regression. logistic y x Logistic regression Log likelihood = -49. 904732 Number of obs LR chi 2(1) Prob > chi 2 Pseudo R 2 = = 100 12. 66 0. 0004 0. 1125 ---------------------------------------y | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------+--------------------------------x | 6 3. 316625 3. 24 0. 001 2. 030635 17. 72844 _cons |. 1111111. 0523783 -4. 66 0. 000. 044106. 2799096 ---------------------------------------
Q& A
- Slides: 44