STAT 101 Dr Kari Lock Morgan 102312 Inference

  • Slides: 47
Download presentation
STAT 101 Dr. Kari Lock Morgan 10/23/12 Inference for Proportions: Normal Distribution Sections 6.

STAT 101 Dr. Kari Lock Morgan 10/23/12 Inference for Proportions: Normal Distribution Sections 6. 1 -6. 3, 6. 7 -6. 9 • Single Proportion, p • Distribution (6. 1) • Intervals and tests (6. 2, 6. 3) • Difference in proportions, p 1 – p 2 • One proportion or two? (6. 7) • Distribution (6. 7) • Intervals and tests (6. 8, 6. 9) Statistics: Unlocking the Power of Data Lock 5

Central Limit Theorem! For a sufficiently large sample size, the distribution of sample statistics

Central Limit Theorem! For a sufficiently large sample size, the distribution of sample statistics for a mean or a proportion is normal Statistics: Unlocking the Power of Data Lock 5

Interval Using N(0, 1) IF SAMPLE SIZES ARE LARGE… A confidence interval can be

Interval Using N(0, 1) IF SAMPLE SIZES ARE LARGE… A confidence interval can be calculated by where z* is a N(0, 1) percentile depending on the level of confidence. Statistics: Unlocking the Power of Data Lock 5

Tests Using N(0, 1) IF SAMPLE SIZES ARE LARGE… A p-value is the area

Tests Using N(0, 1) IF SAMPLE SIZES ARE LARGE… A p-value is the area in the tail(s) of a N(0, 1) beyond Statistics: Unlocking the Power of Data Lock 5

Standard Errors Today, we’ll learn formulas for the standard errors. Statistics: Unlocking the Power

Standard Errors Today, we’ll learn formulas for the standard errors. Statistics: Unlocking the Power of Data Lock 5

SE of a Proportion The standard error for a sample proportion can be calculated

SE of a Proportion The standard error for a sample proportion can be calculated by *Notice the sample size in the denominator… as the sample size increases, the standard error decreases Statistics: Unlocking the Power of Data Lock 5

Paul the Octopus If he is truly guessing randomly, then p = 0. 5

Paul the Octopus If he is truly guessing randomly, then p = 0. 5 so the SE of his sample proportion correct out of 8 guesses is Statistics: Unlocking the Power of Data Lock 5

Paul the Octopus This is the same value we get from a randomization distribution…

Paul the Octopus This is the same value we get from a randomization distribution… www. lock 5 stat. com/statkey Statistics: Unlocking the Power of Data Lock 5

Paul the Octopus If Paul really does have psychic powers, and can guess the

Paul the Octopus If Paul really does have psychic powers, and can guess the correct team every time, then p = 1, and Statistics: Unlocking the Power of Data Lock 5

 Statistics: Unlocking the Power of Data Lock 5

Statistics: Unlocking the Power of Data Lock 5

CLT for a Proportion If counts for each category are at least 10 (np

CLT for a Proportion If counts for each category are at least 10 (np ≥ 10 and n(1 – p) ≥ 10), then Statistics: Unlocking the Power of Data Lock 5

Standard Error • One small problem… if we are doing inference for p, we

Standard Error • One small problem… if we are doing inference for p, we don’t know p! • For confidence intervals, use your best guess for p: Statistics: Unlocking the Power of Data Lock 5

Confidence Interval for a Single Proportion Statistics: Unlocking the Power of Data Lock 5

Confidence Interval for a Single Proportion Statistics: Unlocking the Power of Data Lock 5

Obama vs Romney On 10/17/12, a random sample of 500 North Carolina likely voters

Obama vs Romney On 10/17/12, a random sample of 500 North Carolina likely voters were polled. 260 said they plan to vote for Mitt Romney. Give a 95% CI for the proportion of likely voters in North Carolina that support Mitt Romney. http: //www. rasmussenreports. com/public_content/politics/elections/electio n_2012/election_2012_presidential_election/north_carolina/election_2012_n orth_carolina_president Statistics: Unlocking the Power of Data Lock 5

Obama vs Romney Counts are greater than 10 in each category For a 95%

Obama vs Romney Counts are greater than 10 in each category For a 95% confidence interval, z* = 2 We are 95% confident that between 47. 5% and 56. 6% of likely voters in North Carolina support Romney. Statistics: Unlocking the Power of Data Lock 5

Obama vs Romney Statistics: Unlocking the Power of Data Lock 5

Obama vs Romney Statistics: Unlocking the Power of Data Lock 5

Obama vs Romney Statistics: Unlocking the Power of Data Lock 5

Obama vs Romney Statistics: Unlocking the Power of Data Lock 5

Other Levels of Confidence www. lock 5 stat. com/statkey Technically, for 95% confidence, z*

Other Levels of Confidence www. lock 5 stat. com/statkey Technically, for 95% confidence, z* = 1. 96, but 2 is much easier to remember, and close enough Statistics: Unlocking the Power of Data Lock 5

z* on TI-83 P% -z* z* 2 nd DISTR 3: inv. Norm( Proportion below

z* on TI-83 P% -z* z* 2 nd DISTR 3: inv. Norm( Proportion below z* (for a 95% CI, the proportion below z* is 0. 975) Statistics: Unlocking the Power of Data Lock 5

Margin of Error For a single proportion, what is the margin of error? a)

Margin of Error For a single proportion, what is the margin of error? a) b) CI = statistic margin of error c) Statistics: Unlocking the Power of Data Lock 5

Margin of Error You can choose your sample size in advance, depending on your

Margin of Error You can choose your sample size in advance, depending on your desired margin of error! Given this formula for margin of error, solve for n. Statistics: Unlocking the Power of Data Lock 5

Margin of Error Statistics: Unlocking the Power of Data Lock 5

Margin of Error Statistics: Unlocking the Power of Data Lock 5

Margin of Error Suppose we want to estimate a proportion with a margin of

Margin of Error Suppose we want to estimate a proportion with a margin of error of 0. 03 with 95% confidence. How large a sample size do we need? (a) About 100 (b) About 500 (c) About 1000 (d) About 5000 Statistics: Unlocking the Power of Data Lock 5

Hypothesis Testing For hypothesis testing, we want the distribution of the sample proportion assuming

Hypothesis Testing For hypothesis testing, we want the distribution of the sample proportion assuming the null hypothesis is true What to use for p? Statistics: Unlocking the Power of Data Lock 5

Hypothesis Testing The p-value is the area in the tail(s) beyond z in a

Hypothesis Testing The p-value is the area in the tail(s) beyond z in a N(0, 1) Statistics: Unlocking the Power of Data Lock 5

Baseball Home Field Advantage Of the 2430 Major League Baseball (MLB) games played in

Baseball Home Field Advantage Of the 2430 Major League Baseball (MLB) games played in 2009, the home team won in 54. 9% of the games. If we consider 2009 as a representative sample of all MLB games, is this evidence of a home field advantage in Major League Baseball? (a) Yes (b) No (c) No idea The p-value is very small, so we have very strong evidence of a home field advantage. Statistics: Unlocking the Power of Data Lock 5

Baseball Home Field Advantage Counts are greater than 10 in each category Based on

Baseball Home Field Advantage Counts are greater than 10 in each category Based on this data, there is strong evidence of a home field advantage in major league baseball. Statistics: Unlocking the Power of Data Lock 5

Baseball Home Field Advantage Statistics: Unlocking the Power of Data Lock 5

Baseball Home Field Advantage Statistics: Unlocking the Power of Data Lock 5

p-value on TI-83 2 nd DISTR 3: normalcdf( lower bound, upper bound Hint: if

p-value on TI-83 2 nd DISTR 3: normalcdf( lower bound, upper bound Hint: if you want greater than 2, just put 2, 100 (or some other large number) Statistics: Unlocking the Power of Data Lock 5

One Proportion or Two? • Two proportions: there are two separate categorical variables •

One Proportion or Two? • Two proportions: there are two separate categorical variables • One proportion: there is only one categorical variable Statistics: Unlocking the Power of Data Lock 5

One Proportion or Two? Of residents in the triangle area on Saturday, was the

One Proportion or Two? Of residents in the triangle area on Saturday, was the proportion of people cheering for Duke or UNC greater? How much greater? a) Inference for one proportion b) Inference for two proportions (Note: assume no one will be cheering for both) This is one categorical variable: which team each person will be cheering for on Saturday night. Statistics: Unlocking the Power of Data Lock 5

One Proportion or Two? Who was more likely to be wearing a blue shirt

One Proportion or Two? Who was more likely to be wearing a blue shirt on Saturday night, a UNC fan or a Duke fan? a) Inference for one proportion b) Inference for two proportions This is two categorical variables: which team each person will be cheering for on Saturday night, and whether each person is wearing a blue shirt. Statistics: Unlocking the Power of Data Lock 5

 Statistics: Unlocking the Power of Data Lock 5

Statistics: Unlocking the Power of Data Lock 5

 If counts within each category (each cell of the twoway table) are at

If counts within each category (each cell of the twoway table) are at least 10 Statistics: Unlocking the Power of Data Lock 5

Metal Tags and Penguins Are metal tags detrimental to penguins? A study looked at

Metal Tags and Penguins Are metal tags detrimental to penguins? A study looked at the 10 year survival rate of penguins tagged either with a metal tag or an electronic tag. 20% of the 167 metal tagged penguins survived, compared to 36% of the 189 electronic tagged penguins. Give a 90% confidence interval for the difference in proportions. Source: Saraux, et. al. (2011). “Reliability of flipperbanded penguins as indicators of climate change, ” Nature, 469, 203 -206. Statistics: Unlocking the Power of Data Lock 5

Metal Tags and Penguins Statistics: Unlocking the Power of Data We are 90% confident

Metal Tags and Penguins Statistics: Unlocking the Power of Data We are 90% confident that the survival rate is between 0. 09 and 0. 237 lower for metal tagged penguins, as opposed to electronically tagged. Lock 5

Metal Tags and Penguins www. lock 5 stat. com/statkey Statistics: Unlocking the Power of

Metal Tags and Penguins www. lock 5 stat. com/statkey Statistics: Unlocking the Power of Data Lock 5

Hypothesis Testing What should we use for p 1 and p 2 in the

Hypothesis Testing What should we use for p 1 and p 2 in the formula for SE for hypothesis testing? Statistics: Unlocking the Power of Data Lock 5

Pooled Proportion Overall sample proportion across both groups. It will be in between the

Pooled Proportion Overall sample proportion across both groups. It will be in between the two observed sample proportions. Statistics: Unlocking the Power of Data Lock 5

Hypothesis Testing The p-value is the area in the tail(s) beyond z in a

Hypothesis Testing The p-value is the area in the tail(s) beyond z in a N(0, 1) Statistics: Unlocking the Power of Data Lock 5

Metal Tags and Penguins 20% of the 167 metal tagged penguins survived, compared to

Metal Tags and Penguins 20% of the 167 metal tagged penguins survived, compared to 36% of the 189 electronic tagged penguins. Are metal tags detrimental to penguins? (a) Yes. The p-value is very small. (b) No (c) Cannot tell from this data Statistics: Unlocking the Power of Data Lock 5

Metal Tags and Penguins Are metal tags detrimental to penguins? Statistics: Unlocking the Power

Metal Tags and Penguins Are metal tags detrimental to penguins? Statistics: Unlocking the Power of Data Lock 5

Metal Tags and Penguins This is very strong evidence that metal tags are detrimental

Metal Tags and Penguins This is very strong evidence that metal tags are detrimental to penguins. Statistics: Unlocking the Power of Data Lock 5

Metal Tags and Penguins www. lock 5 stat. com/statkey Statistics: Unlocking the Power of

Metal Tags and Penguins www. lock 5 stat. com/statkey Statistics: Unlocking the Power of Data Lock 5

Accuracy • The accuracy of intervals and p-values generated using simulation methods (bootstrapping and

Accuracy • The accuracy of intervals and p-values generated using simulation methods (bootstrapping and randomization) depends on the number of simulations (more simulations = more accurate) • The accuracy of intervals and p-values generated using formulas and the normal distribution depends on the sample size (larger sample size = more accurate) • If the distribution of the statistic is truly normal and you have generated many simulated randomizations, the p-values should be very close Statistics: Unlocking the Power of Data Lock 5

Summary • For a single proportion: • For a difference in proportions: Statistics: Unlocking

Summary • For a single proportion: • For a difference in proportions: Statistics: Unlocking the Power of Data Lock 5

To Do �Read Sections 6. 1, 6. 2, 6. 3, 6. 7, 6. 8,

To Do �Read Sections 6. 1, 6. 2, 6. 3, 6. 7, 6. 8, 6. 9 �Do Homework 5 (due Tuesday, 10/30) Statistics: Unlocking the Power of Data Lock 5