Unit 5 Inference for categorical data 1 Inference
- Slides: 39
Unit 5: Inference for categorical data 1. Inference for a single proportion Sta 101 - Fall 2018 Duke University, Department of Statistical Science Dr. Ellison Slides posted at https: //www 2. stat. duke. edu/courses/Fall 18/sta 101. 001/index. html 1
Outline 1. Housekeeping 2. Main ideas 1. The CLT also describes the distribution of pˆ 2. CI vs. HT determines observed vs. expected counts / proportions 3. Only use CLT based methods if the sample size is large enough for a nearly normal sampling distribution 3. Applications 1. Single population proportion, large sample 2. Single population proportion, small sample 4. Recap 5. Summary 2
Announcements • Problem Set 4 due today October 24 11: 55 pm • Project proposal due Thursday October 25 before your lab section. • Performance Assessment 4 due Sunday October 28 (opens today) 3
Outline 1. Housekeeping 2. Main ideas 1. The CLT also describes the distribution of pˆ 2. CI vs. HT determines observed vs. expected counts / proportions 3. Only use CLT based methods if the sample size is large enough for a nearly normal sampling distribution 3. Applications 1. Single population proportion, large sample 2. Single population proportion, small sample 4. Recap 5. Summary 4
Outline 1. Housekeeping 2. Main ideas 1. The CLT also describes the distribution of pˆ 2. CI vs. HT determines observed vs. expected counts / proportions 3. Only use CLT based methods if the sample size is large enough for a nearly normal sampling distribution 3. Applications 1. Single population proportion, large sample 2. Single population proportion, small sample 4. Recap 5. Summary 5
Distribution of pˆ Conditions: ▶ Independence: a. Random sample/assignment b. 10% rule ▶ Shape/Skew: a. “Success/Failure Conditions: ” At least 10 successes and failures (ie: np ≥ 10, n(1 -p) ≥ 10) 6
Clicker question Suppose p = 0. 05. What shape does the distribution of pˆ have in random samples of n = 100. (a) unimodal and symmetric (nearly normal) (b) bimodal and symmetric (c) right skewed (d) left skewed 7
Clicker question Suppose p = 0. 05. What shape does the distribution of pˆ have in random samples of n = 100. (a) unimodal and symmetric (nearly normal) (b) bimodal and symmetric (c) right skewed (d) left skewed 8
Clicker question Suppose p = 0. 5. What shape does the distribution of pˆhave in random samples of n = 100. (a) unimodal and symmetric (nearly normal) (b) bimodal and symmetric (c) right skewed (d) left skewed 9
Clicker question Suppose p = 0. 5. What shape does the distribution of pˆhave in random samples of n = 100. (a) unimodal and symmetric (nearly normal) (b) bimodal and symmetric (c) right skewed (d) left skewed 10
Outline 1. Housekeeping 2. Main ideas 1. The CLT also describes the distribution of pˆ 2. CI vs. HT determines observed vs. expected counts / proportions 3. Only use CLT based methods if the sample size is large enough for a nearly normal sampling distribution 3. Applications 1. Single population proportion, large sample 2. Single population proportion, small sample 4. Recap 5. Summary 11
CI vs. HT determines observed vs. expected counts / proportions 12
CI vs. HT determines observed vs. expected counts / proportions Problem: However, in Confidence intervals and Hypothesis Testing, we don’t know what p is…. 13
Distribution of pˆ Conditions: ▶ Independence: a. Random sample/assignment b. 10% rule ▶ Shape/Skew: a. At least 10 successes and failures (ie: np ≥ 10, n(1 -p) ≥ 10) 14
CI vs. HT determines observed vs. expected counts / proportions Answer: 15
CI vs. HT determines observed vs. expected counts / proportions Interpreting… 16
Outline 1. Housekeeping 2. Main ideas 1. The CLT also describes the distribution of pˆ 2. CI vs. HT determines observed vs. expected counts / proportions 3. Only use CLT based methods if the sample size is large enough for a nearly normal sampling distribution 3. Applications 1. Single population proportion, large sample 2. Single population proportion, small sample 4. Recap 5. Summary 17
Simulation vs. theoretical inference Conditions: ▶ Independence: a. Random sample/assignment b. 10% rule ▶ Shape/Skew: a. “Success/Failure Conditions: ” At least 10 successes and failures (ie: np ≥ 10, n(1 -p) ≥ 10) Condition met: Condition NOT met: Hypothesis Test with Randomization Test (Simulation Methods) 18
Outline 1. Housekeeping 2. Main ideas 1. The CLT also describes the distribution of pˆ 2. CI vs. HT determines observed vs. expected counts / proportions 3. Only use CLT based methods if the sample size is large enough for a nearly normal sampling distribution 3. Applications 1. Single population proportion, large sample 2. Single population proportion, small sample 4. Recap 5. Summary 19
Outline 1. Housekeeping 2. Main ideas 1. The CLT also describes the distribution of pˆ 2. CI vs. HT determines observed vs. expected counts / proportions 3. Only use CLT based methods if the sample size is large enough for a nearly normal sampling distribution 3. Applications 1. Single population proportion, large sample 2. Single population proportion, small sample 4. Recap 5. Summary 20
Application exercise: App Ex 5. 1 See course website for details. 21
Outline 1. Housekeeping 2. Main ideas 1. The CLT also describes the distribution of pˆ 2. CI vs. HT determines observed vs. expected counts / proportions 3. Only use CLT based methods if the sample size is large enough for a nearly normal sampling distribution 3. Applications 1. Single population proportion, large sample 2. Single population proportion, small sample 4. Recap 5. Summary 22
Clicker question Are you vegetarian or vegan? (a) Yes, I am vegetarian or vegan (b) No, I am neither vegetarian nor vegan 23
Clicker question A variety of studies suggest that 8% of college students are vegetarian or vegan. Assuming that this class is a representative sample of Duke students, which of the following are the correct set of hypotheses for testing if the proportion of Duke students who are vegetarian is different than the proportion of vegetarian college students at large. 24
Clicker question A variety of studies suggest that 8% of college students are vegetarian or vegan. Assuming that this class is a representative sample of Duke students, which of the following are the correct set of hypotheses for testing if the proportion of Duke students who are vegetarian is different than the proportion of vegetarian college students at large. 25
Simulate by hand Describe a simulation scheme for this hypothesis test. 26
Simulate by hand Describe a simulation scheme for this hypothesis test. Goal: 1. Create an approximation for the sampling distribution that assumes Ho is true. 2. Calculate p-value with this approximation sampling distribution. Ho: p=8/100=0. 08 27
Simulate by hand Describe a simulation scheme for this hypothesis test. Goal: 1. Create an approximation for the sampling distribution that assumes Ho is true. 2. Calculate p-value with this approximation sampling distribution. Example: Ho: p=8/100=0. 08 ▶ 100 chips in a bag: 8 green (veg), 92 white (non veg). ▶ Sample randomly n times from the bag, with replacement (n = observed sample size) ▶ Calculate pˆ, the proportion of greens (successes) in the random sample of size n, record this value. ▶ Repeat many times. ▶ Calculate the proportion of simulations where pˆ is at least as different from 0. 08 (null value) as the observed sample proportion. 28
Simulate by hand Describe a simulation scheme for this hypothesis test. Goal: 1. Create an approximation for the sampling distribution that assumes Ho is true. 2. Calculate p-value with this approximation sampling distribution. Example: P-value= ? /50 29
HT in R n_veg = [fill in based on class data] n_nonveg = [fill in based on class data] sta 101 = data. frame(veg = c(rep("yes", n_veg), rep("no", n_nonveg))) inference(y = veg, data = sta 101, success = "yes", statistic = "proportion", type = "ht", null = 0. 08, alternative = "twosided", method = "simulation") 30
Bootstrap interval for a single proportion How would the simulation scheme change for a bootstrap interval for the proportion of Duke students who are vegetarians? 31
Bootstrap interval for a single proportion How would the simulation scheme change for a bootstrap confidence interval for the proportion of Duke students who are vegetarians? ▶ 100 chips in a bag: 8 green (veg), 92 white (non veg). ▶ Sample randomly n times from the bag original sample, with replacement (n = observed sample size) ▶ Calculate pˆ, the proportion of vegetarians (successes) in the random sample of size n, record this value. ▶ Repeat many times. ▶ Find a confidence interval using the Percentile Method or the SE Method 32
Simulate by hand ▶ Find a confidence interval using the Percentile Method or the SE Method 88% Confidence Interval: (4. 2%, 12. 4%) Percentile Method: Middle 44=(0. 88)*(50) points 33
CI in R inference(y = veg, data = sta 101, success = "yes", statistic = "proportion", type = "ci", method = "simulation", boot_method = "se") 34
Outline 1. Housekeeping 2. Main ideas 1. The CLT also describes the distribution of pˆ 2. CI vs. HT determines observed vs. expected counts / proportions 3. Only use CLT based methods if the sample size is large enough for a nearly normal sampling distribution 3. Applications 1. Single population proportion, large sample 2. Single population proportion, small sample 4. Recap 5. Summary 35
Recap on CLT based methods ▶ Calculating the necessary sample size for a CI with a given margin of error: – If there is a previous study, use pˆfrom that study – If not, use pˆ= 0. 5: • if you don’t know any better, 50 -50 is a good guess • pˆ= 0. 5 gives the most conservative estimate – highest possible sample size 36
Recap on CLT based methods ▶ Calculating the necessary sample size for a CI with a given margin of error: – Option 1: If there is a previous study, use pˆ from that study – Option 2: If not, use pˆ= 0. 5: • if you don’t know any better, 50 -50 is a good guess • pˆ= 0. 5 gives the most conservative estimate – highest possible sample size 37
Outline 1. Housekeeping 2. Main ideas 1. The CLT also describes the distribution of pˆ 2. CI vs. HT determines observed vs. expected counts / proportions 3. Only use CLT based methods if the sample size is large enough for a nearly normal sampling distribution 3. Applications 1. Single population proportion, large sample 2. Single population proportion, small sample 4. Recap 5. Summary 38
Summary of main ideas 1. The CLT also describes the distribution of pˆ 2. CI vs. HT determines observed vs. expected counts / proportions 3. Only use CLT based methods if the sample size is large enough for a nearly normal sampling distribution 39
- Biologists wish to cross pairs of tobacco plants
- Chapter 11 inference for distributions of categorical data
- Chapter 11 inference for distributions of categorical data
- What is numerical discrete
- What statistical test for categorical data
- Bivariate cpm
- Categorical data classification
- H0no
- Categorical data displays
- Analyzing categorical data
- Categorical data analysis spss
- Categorical data examples
- What is a conditional relative frequency
- Example of categorical data
- Eda 1
- Analyzing categorical data
- Unit 6 review questions
- Kontinuitetshantering
- Typiska drag för en novell
- Nationell inriktning för artificiell intelligens
- Returpilarna
- Shingelfrisyren
- En lathund för arbete med kontinuitetshantering
- Underlag för särskild löneskatt på pensionskostnader
- Tidbok
- Sura för anatom
- Förklara densitet för barn
- Datorkunskap för nybörjare
- Boverket ka
- Debatt artikel mall
- Magnetsjukhus
- Nyckelkompetenser för livslångt lärande
- Påbyggnader för flakfordon
- Kraft per area
- Offentlig förvaltning
- Urban torhamn
- Presentera för publik crossboss
- Argument för teckenspråk som minoritetsspråk
- Plats för toran ark
- Klassificeringsstruktur för kommunala verksamheter