Applications of Statistics in Research Bandit Thinkhamrop Ph

Applications of Statistics in Research Bandit Thinkhamrop, Ph. D. (Statistics) Department of Biostatistics and Demography Faculty of Public Health Khon Kaen University

Begin at the conclusion

Type of the study outcome: Key for selecting appropriate statistical methods Study outcome – Dependent variable or response variable – Focus on primary study outcome if there are more Type of the study outcome – Continuous – Categorical (dichotomous, polytomous, ordinal) – Numerical (Poisson) count – Even-free duration

Continuous outcome Primary target of estimation: – Mean (SD) – Median (Min: Max) – Correlation coefficient: r and ICC Modeling: – Linear regression The model coefficient = Mean difference – Quantile regression The model coefficient = Median difference Example: – Outcome = Weight, BP, score of ? , level of ? , etc. – RQ: Factors affecting birth weight

Categorical outcome Primary target of estimation : – Proportion or Risk Modeling: – Logistic regression The model coefficient = Odds ratio (OR) Example: – Outcome = Disease (y/n), Dead(y/n), cured(y/n), etc. – RQ: Factors affecting low birth weight

Numerical (Poisson) count outcome Primary target of estimation : – Incidence rate (e. g. , rate person time) Modeling: – Poisson regression The model coefficient = Incidence ratio (IRR) Example: – Outcome = Total number of falls Total time at risk of falling – RQ: Factors affecting tooth elderly fall

Event-free duration outcome Primary target of estimation : – Median survival time Modeling: – Cox regression The model coefficient = Hazard ratio (HR) Example: – Outcome = Overall survival, disease-free survival, progression-free survival, etc. – RQ: Factors affecting survival

The outcome determine statistics Continuous Mean Median Categorical Proportion (Prevalence Or Risk) Linear Reg. Count Survival Rate per “space” Median survival Risk of events at T(t) Logistic Reg. Poisson Reg. Cox Reg.

Statistics quantify errors for judgments Parameter estimation [95%CI] Hypothesis testing [P-value]

Statistics quantify errors for judgments Parameter estimation [95%CI] Hypothesis testing [P-value] 7

Caution about biases Selection bias Information bias Confounding bias Research Design -Prevent them -Minimize them

Caution about biases Selection bias (SB) Information bias (IB) Confounding bias (CB) If data available: SB & IB can be assessed CB can be adjusted using multivariable analysis

Generate a mock data set General format of the data layout id 1 2 3 4 5 … n y x 1 x 2 X 3

Generate a mock data set Continuous outcome example id 1 2 3 4 5 … n y 2 2 0 2 14 x 1 1 0 1 x 2 21 12 4 89 0 X 3 22 19 20 21 18 6 0 45 21 Mean (SD)

Common types of the statistical goals Single measurements (no comparison) Difference (compared by subtraction) Ratio (compared by division) Prediction (diagnostic test or predictive model) Correlation (examine a joint distribution) Agreement (examine concordance or similarity between pairs of observations)

Back to the conclusion Continuous Categorical Count Survival Appropriate statistical methods Mean Median Proportion (Prevalence or Risk) Rate per “space” Median survival Risk of events at T(t) Magnitude of effect 95% CI Answer the research question based on lower or upper limit of the CI P-value

Always report the magnitude of effect and its confidence interval Absolute effects: – Mean, Mean difference – Proportion or prevalence, Rate or risk, Rate or Risk difference – Median survival time Relative effects: – Relative risk, Rate ratio, Hazard ratio – Odds ratio Other magnitude of effects: – – Correlation coefficient (r), Intra-class correlation (ICC) Kappa Diagnostic performance Etc.

Touch the variability (uncertainty) to understand statistical inference id 1 2 3 A 2 2 0 4 5 Sum ( ) Mean(X) SD Median 2 14 20 4 2 (x-X ) (x- X ) 2 -2 4 -4 16 -2 4 10 100 0 128 0 32. 0 5. 66 2+2+0+2+14 = 20 = 4 5 5 0 2 2 2 14 Variance = SD 2 Standard deviation = SD

Touch the variability (uncertainty) to understand statistical inference id 1 2 3 A 2 2 0 4 5 Sum ( ) Mean(X) SD Median 2 14 20 4 2 (x-X ) (x- X ) 2 -2 4 -4 16 -2 4 10 100 0 128 0 32. 0 5. 66 Measure of central tendency Measure of variation

Standard deviation (SD) = The average distant between each data item to their mean Degree of freedom

Same mean BUT different variation id 1 2 3 A 2 2 0 id 1 2 3 B 0 3 4 id 1 2 3 C 3 4 4 4 5 Sum ( ) Mean SD Median 2 14 20 4 5. 66 2 4 5 Sum ( ) Mean SD Median 5 8 20 4 2. 91 4 4 5 Sum ( ) Mean SD Median 4 5 20 4 0. 71 4 Heterogeneous data Homogeneous data Skew distribution Symmetry distribution

Facts about Variation Because of variability, repeated samples will NOT obtain the same statistic such as mean or proportion: – Statistics varies from study to study because of the role of chance – Hard to believe that the statistic is the parameter – Thus we need statistical inference to estimate the parameter based on the statistics obtained from a study Data varied widely = heterogeneous data Heterogeneous data requires large sample size to achieve a conclusive finding

The Histogram id A id B 1 2 1 4 2 2 2 3 3 0 3 5 4 2 4 4 5 14 5 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

The Frequency Curve id A id B 1 2 1 4 2 2 2 3 3 0 3 5 4 2 4 4 5 14 5 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Area Under The Frequency Curve id A id B 1 2 1 4 2 2 2 3 3 0 3 5 4 2 4 4 5 14 5 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Central Limit Theorem Right Skew X 1 Symmetry X 2 Left Skew X 3 X 1 XX Xn Normally distributed

Central Limit Theorem X 1 Distribution of the raw data X 2 X 3 X 1 XX Xn Distribution of the sampling mean

Central Limit Theorem Distribution of the raw data X 1 XX Xn Distribution of the sampling mean Large sample (Theoretical) Normal Distribution

Central Limit Theorem Many X, X , SD X 1 XX Xn Standard deviation of the sampling mean Standard error (SE) Estimated by SE = SD n Many X , XX , SE Large sample Standardized for whatever n, Mean = 0, Standard deviation = 1

(Theoretical) Normal Distribution

99. 73% of AUC Mean ± 3 SD

95. 45% of AUC Mean ± 2 SD

68. 26% of AUC Mean ± 1 SD

Sample n = 25 X = 52 SD = 5 Population Parameter estimation [95%CI] Hypothesis testing [P-value]

Z = 2. 58 Z = 1. 96 Z = 1. 64 5 5 = 1

Z = 2. 58 Z = 1. 96 Z = 1. 64 Parameter estimation [95%CI] : Sample n = 25 X = 52 SD = 5 SE = 1 Population 52 -1. 96(1) to 52+1. 96(1) 50. 04 to 53. 96 We are 95% confidence that the population mean would lie between 50. 04 and 53. 96

Sample n = 25 X = 52 SD = 5 SE = 1 Population Hypothesis testing H 0 : = 55 HA : 55 Z = 55 – 52 1 3

52 -3 SE 55 +3 SE Hypothesis testing H 0 : = 55 HA : 55 Z = 55 – 52 3 P-value = 1 -0. 9973 = 0. 0027 1 If the true mean in the population is 55, chance to obtain a sample mean of 52 or more extreme is 0. 0027.

P-value is the magnitude of chance NOT magnitude of effect P-value < 0. 05 = Significant findings Small chance of being wrong in rejecting the null hypothesis If in fact there is no [effect], it is unlikely to get the [effect] = [magnitude of effect] or more extreme Significance DOES NOT MEAN importance Any extra-large studies can give a very small Pvalue even if the [magnitude of effect] is very small

P-value is the magnitude of chance NOT magnitude of effect P-value > 0. 05 = Non-significant findings High chance of being wrong in rejecting the null hypothesis If in fact there is no [effect], the [effect] = [magnitude of effect] or more extreme can be occurred chance. Non-significance DOES NOT MEAN no difference, equal, or no association Any small studies can give a very large P-value even if the [magnitude of effect] is very large

P-value vs. 95%CI (1) An example of a study with dichotomous outcome A study compared cure rate between Drug A and Drug B Setting: Drug A = Alternative treatment Drug B = Conventional treatment Results: Drug A: n 1 = 50, Pa = 80% Drug B: n 2 = 50, Pb = 50% Pa-Pb = 30% (95%CI: 26% to 34%; P=0. 001)

P-value vs. 95%CI (2) Pa > Pb Pb > Pa Pa-Pb = 30% (95%CI: 26% to 34%; P< 0. 05)

P-value vs. 95%CI (3) Adapted from: Armitage, P. and Berry, G. Statistical methods in medical research. 3 rd edition. Blackwell Scientific Publications, Oxford. 1994. page 99

Tips #6 (b) P-value vs. 95%CI (4) Adapted from: Armitage, P. and Berry, G. Statistical methods in medical research. 3 rd edition. Blackwell Scientific Publications, Oxford. 1994. page 99 There were statistically significant different between the two groups.

Tips #6 (b) P-value vs. 95%CI (5) Adapted from: Armitage, P. and Berry, G. Statistical methods in medical research. 3 rd edition. Blackwell Scientific Publications, Oxford. 1994. page 99 There were no statistically significant different between the two groups.

P-value vs. 95%CI (4) Save tips: – Always report 95%CI with p-value, NOT report solely p-value – Always interpret based on the lower or upper limit of the confidence interval, p-value can be an optional – Never interpret p-value > 0. 05 as an indication of no difference or no association, only the CI can provide this message.

Q&A Thank you