Statistics in R THE CHI SQUARE TEST BY
Statistics in R THE CHI SQUARE TEST BY KELSEY HUNTZBERRY, MPH
Class Live Stream & Files Link To access the live stream and files for class go to my website: www. kelseyhuntzberry. com/data-science-classes/ **Link at the bottom of the page
Poster Session in 2020 • Discussed with library a poster session for students and others interested • You can complete a project with real data and create a poster • We will network with employers to encourage attendance • Way to demonstrate that you are proficient in statistics/R in front of people that matter • First session will be either in late January or early March • Stay tuned!
What Is a Chi Square Test? • Tests how likely a distribution is due to chance • Can be used with two categorical variables • Compares the observed distribution to the expected distribution if the variables were independent
Chi Square Null and Alternative Hypotheses • Null Hypothesis (HO): Two categorical variables are independent • Alternative Hypothesis (Ha): Variables are dependent on one another or related in some way
Example Scenario • Want to detect whether effectiveness of a new tuberculosis drug that is supposed to shorten a hospital stay varies by gender • Tested participants after 10 days were tested for tuberculosis • The average hospital stay is typically 15 days Results of Tuberculosis Test Female Gender Male Negative 70 66 Positive 15 21
Expected If Independent • Need to estimate the values that may occur if the variables were independent • Compare these to what actually occurred in practice • Comparing the actual and expected values is how a chi square is calculated
Calculating Expected Values (Row Sum*Column Sum) Expected Value = Total Sum Results of Tuberculosis Test Negative Positive Totals Female 70 15 85 Gender Male 66 21 87 Totals 136 36 172 [1, 1] = 85*136/172 = 67. 2 [1, 2] = 85*36/172 = 17. 8 [2, 1] = 87*136/172 = 68. 8 [2, 2] = 87*36/172 = 18. 2
Actual Versus Expected • With chi square we are testing whether “actual” table is significantly different from our “expected” table Actual Expected Results of Tuberculosis Test Gender Female Male Negative Positive 70 15 66 21 Results of Tuberculosis Test Gender Female Male Negative Positive 67. 2 17. 8 68. 8 18. 2
Calculating a Chi Square Statistic Sum for all cells ( (observed – expected)2 expected ) (70 – 67. 21)2 + (66 – 68. 79)2 + (15– 17. 79)2 + (21– 18. 21)2 67. 21 68. 79 17. 79 18. 21 Results of Tuberculosis Test Gender Female Male Negative Positive 70 15 66 21 = 1. 094 Results of Tuberculosis Test Gender Female Male Negative Positive 67. 2 17. 8 68. 8 18. 2
Checking for Significance • Need to know the degrees of freedom • Degrees of freedom = (# of Rows – 1) x (# of Columns – 1) • For our example the degrees of freedom equal: • (2 -1) x (2 -1) = 1 Results of Tuberculosis Test Female Gender Male Negative 70 66 Positive 15 21
Finding a P-Value • Similar to the t-test, the chi square value and degrees of freedom combination determine a p-value • Can look up on a table or calculate with software • Can play with this link to understand the relationship
Running a Chi Square in R • Put tub. table (the table you just created) into the chisq. test() function • Put correct = F • Turns off a correction that makes it more difficult to achieve significance
Running a Chi Square in R • tuberculosis_results. csv, our raw data, has gender and test results of participants • Read in and assigned the tuberculosis_results. csv to the object tuberculosis • To see first 6 rows use head(tuberculosis)
Running a Chi Square in R • Now use the table() function to calculate frequencies
Running a Chi Square in R • X-squared is the chi square value • Notice it is the same as what we calculated by hand • p-value is greater than 0. 05 so our results is not significant
What Does This Mean • We cannot reject the null hypothesis that gender and effectiveness of the new tuberculosis test are independent • When taken after 10 days, there is not a significant difference between the results in men and women
Tutorial Walk. Through A/B Testing Example from Measuring U: https: //measuringu. com/statistically-significant/
A/B Testing: What Is It? • A/B testing is used by many companies to compare multiple websites • Example: Facebook uses this heavily • This example measures clicks • Randomly shows people one of two websites • Which website format has more clicks?
A/B Testing Example • 435 users were randomly sent to Website A or Website B • 18 out of 220 users (8%) clicked through on landing page A • 6 out of 215 users (3%) clicked through on landing page B • Is this difference statistically significant? • Use chi square to test this, will learn this in an upcoming class Clicked Through Landing Page? Click No Click Version of Website 1 18 202 Website 2 6 209
Calculating Expected Values Expected Value = Sum for all cells Clicked Through Landing Page? Click No Click Version Website 1 of Website 2 Totals 18 202 220 6 209 215 24 411 435 ( (Row Sum*Column Sum) Total Sum ) [1, 1] = 24*220/435 = 12. 14 [1, 2] = 411*220/435 = 207. 86 [2, 1] = 24*215/435 = 11. 86 [2, 2] = 411*215/435 = 203. 14
Calculating a Chi Square Statistic Sum for all cells ( (observed – expected)2 expected ) (18 – 12. 14)2 + (6 – 11. 86)2 + (202 – 207. 86)2 + (209 – 203. 14)2 = 6. 06 12. 14 11. 86 207. 86 203. 14 Clicked Through Landing Page? Version Website 1 of Website 2 Click No Click 18 202 6 209 Clicked Through Landing Page? Version Website 1 of Website 2 Click No Click 12. 14 207. 86 11. 86 203. 14
Checking for Significance • Need to know the degrees of freedom • Degrees of freedom = (# of Rows – 1) x (# of Columns – 1) • For our example the degrees of freedom equal: • (2 -1) x (2 -1) = 1 Clicked Through Landing Page? Version of Website 1 Website 2 Click 18 6 No Click 202 209
A/B Testing Example • Go to https: //measuringu. com/ab-cal/ • Put in the metrics shown here from our example
Running a Chi Square in R • Create matrix with values from our A/B test • Specify number of columns with ncol = 2 • Fill in by row with byrow = TRUE • dimnames for row names and then column names
Running a Chi Square in R • Put ab. test. table (the table you just created) into the chisq. test() function • Put correct = F
Running a Chi Square in R • X-squared is the chi square value • Notice it is the same as what we calculated by hand • p-value is less than 0. 05 so our results are significant
What Does This Mean • We can reject the null hypothesis that the website change and clicks through to the landing page are independent • The website change was associated with a significant increase in clicks
A/B Testing Example • P-value = 0. 014 • The two-tailed p-value is < 0. 05 so it is statistically significant • We would expect to see a meaningless (random) difference of ~1. 4% about 14 times in 1000 • Play with the calculator • Observe how different numbers of successes changes the p-value
Questions?
- Slides: 30