CHAPTER 11 Inference for Distributions of Categorical Data

CHAPTER 11 Inference for Distributions of Categorical Data 11. 1 b Chi-Square Tests for Goodness of Fit The Practice of Statistics, 5 th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers

Chi-Square Tests for Goodness of Fit Learning Objectives After this section, you should be able to: ü STATE appropriate hypotheses and COMPUTE expected counts for a chi-square test for goodness of fit. ü CALCULATE the chi-square statistic, degrees of freedom, and P-value for a chi-square test for goodness of fit. ü PERFORM a chi-square test for goodness of fit. ü CONDUCT a follow-up analysis when the results of a chisquare test are statistically significant. The Practice of Statistics, 5 th Edition 2

The Chi-Square Statistic The chi-square statistic is a measure of how far the observed counts are from the expected counts. The formula for the statistic is where the sum is over all possible values of the categorical variable. The Practice of Statistics, 5 th Edition 3

Carrying Out a Test Conditions for Performing a Chi-Square Test for Goodness of Fit • Random: The data come a well-designed random sample or from a randomized experiment. o 10%: When sampling without replacement, check that n ≤ (1/10)N. • Large Counts: All expected counts are greater than 5 Before we start using the chi-square goodness-of-fit test, we have two important cautions to offer. • The chi-square test statistic compares observed and expected counts. Don’t try to perform calculations with the observed and expected proportions in each category. • When checking the Large Sample Size condition, be sure to examine the expected counts, not the observed counts. The Practice of Statistics, 5 th Edition 4

Carrying Out a Test The Chi-Square Test for Goodness of Fit Suppose the conditions are met. To determine whether a categorical variable has a specified distribution in the population of interest, expressed as the proportion of individuals falling into each possible category, perform a test of H 0: The stated distribution of the categorical variable in the population of interest is correct. Ha: The stated distribution of the categorical variable in the population of interest is not correct. Start by finding the expected count for each category assuming that H 0 is true. Then calculate the chi-square statistic The Practice of Statistics, 5 th Edition 5

Example: A test for equal proportions Problem: In his book Outliers, Malcolm Gladwell suggests that a hockey player’s birth month has a big influence on his chance to make it to the highest levels of the game. Specifically, since January 1 is the cut-off date for youth leagues in Canada (where many National Hockey League (NHL) players come from), players born in January will be competing against players up to 12 months younger. The older players tend to be bigger, stronger, and more coordinated and hence get more playing time, more coaching, and have a better chance of being successful. To see if birth date is related to success (judged by whether a player makes it into the NHL), a random sample of 80 National Hockey League players from a recent season was selected and their birthdays were recorded. The Practice of Statistics, 5 th Edition 6

Example: A test for equal proportions Problem: The one-way table below summarizes the data on birthdays for these 80 players: Birthday Number of Players Jan – Mar Apr – Jun Jul – Sep Oct – Dec 32 20 16 12 Do these data provide convincing evidence that the birthdays of all NHL players are evenly distributed among the four quarters of the year? State: We want to perform a test of H 0: The birthdays of all NHL players are evenly distributed among the four quarters of the year. Ha: The birthdays of all NHL players are not evenly distributed among the four quarters of the year. No significance level was specified, so we’ll use α = 0. 05. The Practice of Statistics, 5 th Edition 7

Example: A test for equal proportions Plan: If the conditions are met, we will perform a chi-square test for goodness of fit. • Random: The data came from a random sample of NHL players. o 10%? Because we are sampling without replacement, there must be at least 10(80) = 800 NHL players. In the season when the data were collected, there were 879 NHL players. • Large Counts: If birthdays are evenly distributed across the four quarters of the year, then the expected counts are all 80(1/4) = 20. These counts are all at least 5. The Practice of Statistics, 5 th Edition 8

Example: A test for equal proportions Do: Test statistic As the excerpt shows, χ2 corresponds to a P-value between 0. 01 and 0. 02. The Practice of Statistics, 5 th Edition 9

Example: A test for equal proportions Conclude: Because the P-value, 0. 011, is less than α = 0. 05, we reject H 0. We have convincing evidence that the birthdays of NHL players are not evenly distributed across the four quarters of the year. Which type of error—Type I or Type II did we possibly make? Describe in context. It is possible we made a Type I error—finding convincing evidence that the birthdays of NHL players are not uniformly distributed when they really are. The Practice of Statistics, 5 th Edition 10

Landline surveys According to the 2000 census, of all U. S. residents aged 20 and older, 19. 1% are in their 20 s, 21. 5% are in their 30 s, 21. 1% are in their 40 s, 15. 5% are in their 50 s, and 22. 8% are 60 and older. The table below shows the age distribution for a sample of U. S. residents aged 20 and older. Members of the sample were chosen by randomly dialing landline telephone numbers. Do these data provide convincing evidence that the age distribution of people who answer landline telephone surveys is not the same as the age distribution of all U. S. residents? Category 20– 29 30– 39 40– 49 50– 59 60+ Total The Practice of Statistics, 5 th Edition Count 141 186 224 211 286 1048 11

Landline surveys • The Practice of Statistics, 5 th Edition 12

Landline surveys • The Practice of Statistics, 5 th Edition 13

Landline surveys follow-up analysis We concluded that the age distribution of people who answer landline telephone surveys is not the same as the age distribution of all U. S. residents. The table below shows the observed counts, expected counts, the difference in counts (O – E) and the chi-square contribution for each age category. The two age groups that contributed the most to the chi-square statistic were the 20 - to 29 -year-olds (59. 2 fewer than expected) and the 50 - to 59 -year-olds (48. 6 more than expected). Analysis notes: • Only do a follow-up analysis when specifically asked to. • When doing a follow up analysis, don’t focus only on the size of the contribution, also discuss the direction of the difference. The Practice of Statistics, 5 th Edition 14

Chi-Square Test for Goodness of Fit Section Summary In this section, we learned how to… ü STATE appropriate hypotheses and COMPUTE expected counts for a chi-square test for goodness of fit. ü CALCULATE the chi-square statistic, degrees of freedom, and P-value for a chi-square test for goodness of fit. ü PERFORM a chi-square test for goodness of fit. ü CONDUCT a follow-up analysis when the results of a chi-square test are statistically significant. ü Read p. 687 -692 ccc 7, 9, 11, 15, 17 The Practice of Statistics, 5 th Edition 15