Sampling and Confidence Interval EpidemiologyBiostatistics Kenneth Kwan Ho

Sampling and Confidence Interval Epidemiology/Biostatistics Kenneth Kwan Ho Chui, Ph. D, MPH Department of Public Health and Community Medicine kenneth. chui@tufts. edu 617. 636. 0853

Learning objectives in the syllabus Understand how a histogram can be read as a probability distribution Understand the importance of random sampling in statistics Understand how sample means can have distributions Explain the behavior (distribution) of sample means and the Central Limit Theorem Know how to interpret confidence intervals as seen in the medical literature Know how to calculate a confidence interval for a mean

Population Sample Parameter Sample statistics Distribution of sample means Know how to interpret and calculate a confidence interval for statistical inference Types of data How to summarize data Central tendency Variability How to evaluate graphs

Assumed knowledge for today Mean Variance Standard deviation The 68 -95 -99 rule

Central tendency: Mean Consider a variable with data: 1, 2, 3, 3, 4, 4, 4, 5, 5, 6

2 Variance & Standard deviation 2 2 Values 2 2 Sum them up Divide by (sample size – 1) Variance Observation # SD = √Variance 2

The 68 -95 -99 rule 99% of samples are within ± 3 SD 95% of samples are within ± 2 SD 68% of sample are within ± 1 SD # of SD: Percentile: 0. 5 th 2. 5 th 16 th 50 th 84 th 97. 5 th 99. 5 th

Population Parameter The true mean BMI of Boston, Massachusetts ? Researcher Sample statistics The mean BMI of a sample from Boston, Massachusetts

Sample variation The whole population ? 1, 2, 3, 4, 5, 6 Researchers Samples Means Researcher 1 Researcher 2 Researcher 3 Researcher 4 2, 4 4, 6 1, 2 1, 6 3. 0 5. 0 1. 5 3. 5

l tra en C it em r lim o e th

Central limit theorem The means obtained from many samplings from the same population have the following properties l l l The distribution of the means is always normal if the sample size is big enough (above 120 or so), regardless of the population’s distribution The mean of the sample means is equal to the population mean The standard deviation of the sample means, known as the standard error of the mean (SEM) is inversely related to the sample size: if we repeat the experiment with a bigger sample size, the resultant histogram will be “slimmer”

Understanding CLT through simulation Population size: 10000 Possible values: 0 through 9, 1000 each True population mean: 4. 50

Simulation scheme Sample n=500 A population of 10000 Mean = 4. 5 0 0 0 10 Frequency Sample mean

Sample size = 500; # of draws = 10000 99% 95% SE SD = 0. 13 68% ± 1 SE: 67. 95% Frequency ± 2 SE: 95. 04% ± 3 SE: 99. 10% 4. 5 Sample means

Characteristics for the distribution of means In the previous slide, the mean 4. 5 is the true population parameter, for which we have a Greek name, μ (mu) Similarly, the SD 0. 13 is the true population parameter, called σ (sigma) in Greek. We call this SD of means “standard error of means” (SEM) or “standard error” (SE) SE can be estimated using sample SD:

Why bigger sample sizes are often better Sample means Sample size = 200 Sample size = 500 Sample size = 1000 SE = 0. 20 SE = 0. 13 SE = 0. 08

Confidence interval

I got CLT, so now what? The histogram can be viewed as a “probability distribution” The sample mean from a researcher can be any pixel under the bell curve How should we define “acceptably close” to the population mean? 95%

The confidence interval 95%

If we put a CI on every sample mean, about 95% of them would include the true mean. The two red ones are the “unlucky” samples which do not include the true mean. True mean

Interpretation of a confidence interval The mean and 95% confidence interval (CI) of the blood glucose of a sample is: 140 mg/dl (95%CI: 120, 160) We are 95% confident that the interval 120 and 160 mg/dl includes the true population mean. Our best estimate is 140 mg/dl (i. e. the sample mean) Why only 95% certain? Because the sample mean can be, unfortunately, an extreme one beyond ± 2 SE (the blue zones)

Some common CIs and their z-score multipliers There are two numbers in a confidence interval: the lower and upper confidence limits 90%CI: l Mean ± 1. 65 SE 95%CI: l l l Mean ± 1. 96 SE 2. 00 is an approximation, 1. 96 is recommended The most commonly used criterion 99%CI: l Mean ± 2. 58 SE The more certain we want the interval to include the true mean, the wider the CI becomes l “I am 100% certain that the true mean is between –∞ and ∞. ”

How to narrow down confidence interval? Lower our certainty by opting for, say, a 90%CI instead of a 95%CI Decrease sample standard deviation (for instance, using a more accurate measurement device) Increase sample size

Are confidence intervals always symmetric? Not in all occasions. CIs for untransformed continuous variables are symmetric However, CIs for other statistics such as odds ratios and relative risks are calculated on logarithmic scale. When back-transformed to the ratios, the interval will be asymmetric l “Multivariable analysis revealed a more than 2 -fold increase in the risk of total stroke among men with job strain (combination of high job demand low job control) (hazard ratio, 2. 73; 95% confidence interval, 1. 17 -6. 38)”