Chapter 18 Sampling distribution models math 2200 Sample
Chapter 18 Sampling distribution models math 2200
Sample proportion • Kerry v. s. Bush in 2004 – A Gallup Poll • 49% for Kerry – A Rasmussen Poll • 45. 9% for Kerry – Why the answers are different? • Sample proportion estimates population proportion • There is randomness due to sampling
Modeling the Distribution of Sample Proportions • Imagine what would happen to the sample proportions if we were to actually draw many samples. • What would the histogram of all the sample proportions look like? – The histogram of the sample proportions to center at the true proportion, p, in the population – The histogram is unimodal, symmetric, and centered at p. – A normal model?
Model • Let X be the number of people voting for Bush in a sample of size n • Then X has a binomial model, Binomial(n, p) – p: the proportion of people for Bush in the entire population • When n is large, we can use normal approximation – Normal model with mean np and variance npq
Modeling sample proportion • Sample proportion is X/n – Normal model with mean p and variance pq/n
Example • Back to Kerry v. s. Bush – Assume that the population proportion voting for Kerry is 49% – X/n has a normal model with mean 0. 49 and standard deviation 0. 0158 (n=1000) – Then we know that both 49% and 45. 9 % are reasonable to appear
Conditions • Normal model is an approximation to the exact model – – 1. 2. 3. Use it only when n is large For example, if n=2, then X/n=0, 0. 5 or 1 Randomization Condition: The sample should be a simple random sample of the population. 10% Condition: If sampling has not been made with replacement, then the sample size, n, must be no larger than 10% of the population. Success/Failure Condition: The sample size has to be big enough so that both and are greater than 10.
A Sampling Distribution Model for a Proportion • Before we observe the value of the sample proportion, it is a random variable that has a distribution due to sampling variations. – This distribution is called the sampling distribution model for sample proportions. – We never actually take repeated samples from the same population and make a histogram. We only imagine or simulate them. – Still, sampling distribution models are important because • they act as a bridge from the real world of data to the imaginary model of the statistic and • enable us to say something about the population when all we have is data from the real world.
An example • 13% of the population is left-handed. • A 200 -seat school auditorium was built with 15 “leftie seats” • In a class of n=90 students, what’s the probability that there will NOT be enough seats for the left-handed students? • Let X be the number of left-handed students in the class • We want to find P(X>15) = P(X/n>0. 167)
• Check the conditions – n is large enough – randomization – 10% condition • The population should have more than 900 students – Success/failure condition • np=11. 7>10, nq=78. 3>10 • Normal model for X/n – Mean = 0. 13 – Sd = sqrt(pq/n) = 0. 035 • P(X/n>0. 167) = 0. 1446
Sample Mean • Sample means tend to normal when n is large
Central limit theorem (CLT) • If the observations are drawn – independently – from the same population (distribution) the sampling distribution of the sample mean becomes normal as the sample size increases. • We do not need to know the population distribution.
CLT • Suppose the population distribution has mean μand standard deviation σ • The sample mean has mean μand standard deviation σ/sqrt(n) • Let X 1, …, Xn be n independently and identically distributed random variables – E(X 1) = μ – Var(X 1)= σ2 • Then as n increases, the distribution of (X 1+…+Xn)/n tends to a normal model with mean μand standard deviation σ/sqrt(n)
The Fundamental Theorem of Statistics The Central Limit Theorem (CLT) The mean of a random sample has a sampling distribution whose shape can be approximated by a Normal model. The larger the sample, the better the approximation will be.
Example • Suppose the population distribution of adult weights has mean 175 pounds and sd 25 pounds – the shape is unknown • An elevator has a weight limit of 10 persons or 2000 pounds • What’s the probability that the 10 people who get on the elevator overload its weight limit?
• Let Xi, i=1, 2, …, 10 be the weight of the ith person in the elevator • Then we want to know P(X 1+…+X 10>2000) = • From the CLT (check the requirement first), we know the distribution of is normal with mean 175 pounds and standard deviation • Then
Standard error • Using the CLT, we know the distribution of sample proportion is • However, we do not know p in practice. • Using the CLT, we know the distribution of sample mean is • However, we do not know and
Standard Error • When we don’t know p or σ, we’re stuck, right? • Nope. We will use sample statistics to estimate these population parameters. • Whenever we estimate the standard deviation of a sampling distribution, we call it a standard error.
Standard Error (cont. ) • For a sample proportion, the standard error is • For the sample mean, the standard error is
The Process Going Into the Sampling Distribution Model
What Can Go Wrong? • Don’t confuse the sampling distribution with the distribution of the sample. – When you take a sample, you look at the distribution of the values, usually with a histogram, and you may calculate summary statistics. – The sampling distribution is an imaginary collection of the values that a statistic might have taken for all random samples—the one you got and the ones you didn’t get.
What Can Go Wrong? (cont. ) • Beware of observations that are not independent. – The CLT depends crucially on the assumption of independence. – You can’t check this with your data—you have to think about how the data were gathered. • Watch out for small samples from skewed populations. – The more skewed the distribution, the larger the sample size we need for the CLT to work.
Summary • Sample proportions or sample means are statistics – They are random because samples vary – Their distribution can be approximated by normal using the CLT • Be aware of when the CLT can be used – n is large – If the population distribution is not symmetric, a much larger n is needed • The CLT is about the distribution of the sample mean, not the sample itself
- Slides: 24