STAT 101 Dr Kari Lock Morgan 83012 Collecting

  • Slides: 34
Download presentation
STAT 101 Dr. Kari Lock Morgan 8/30/12 Collecting Data: Sampling SE CT ION 1.

STAT 101 Dr. Kari Lock Morgan 8/30/12 Collecting Data: Sampling SE CT ION 1. 2 • Sample versus Population • Statistical Inference • Sampling Bias • Simple Random Sample • Other Sources of Bias Statistics: Unlocking the Power of Data Lock 5

Sample versus Population A population includes all individuals or objects of interest. A sample

Sample versus Population A population includes all individuals or objects of interest. A sample is all the cases that we have collected data on (a subset of the population). Statistical inference is the process of using data from a sample to gain information about the population. Statistics: Unlocking the Power of Data Lock 5

The Big Picture Population Sampling Sample Statistical Inference Statistics: Unlocking the Power of Data

The Big Picture Population Sampling Sample Statistical Inference Statistics: Unlocking the Power of Data Lock 5

Most Important to You Which of the following is most important to you? a)

Most Important to You Which of the following is most important to you? a) Athletics b) Academics c) Social Life d) Community Service e) Other Statistics: Unlocking the Power of Data Lock 5

Most Important to You �Suppose researchers studying student life at Duke use the results

Most Important to You �Suppose researchers studying student life at Duke use the results of our clicker question to investigate what Duke students find important �What is the sample? �What is the population? �Can the sample data be generalized to make inferences about the population? Why or why not? Statistics: Unlocking the Power of Data Lock 5

Sampling Bias Sampling bias occurs when the method of selecting a sample causes the

Sampling Bias Sampling bias occurs when the method of selecting a sample causes the sample to differ from the population in some relevant way. �If sampling bias exists, we cannot trust generalizations from the sample to the population Statistics: Unlocking the Power of Data Lock 5

Sampling Bias �Sampling bias occurs when the method of selecting a sample causes the

Sampling Bias �Sampling bias occurs when the method of selecting a sample causes the sample to differ from the population in some relevant way �If sampling bias exists, we cannot trust generalizations from the sample to the population Statistics: Unlocking the Power of Data Lock 5

Sampling Population Sample GOAL: Select a sample that is similar to the population, only

Sampling Population Sample GOAL: Select a sample that is similar to the population, only smaller Statistics: Unlocking the Power of Data Lock 5

Can you avoid sampling bias? �The next slide shows Lincoln’s Gettysburg Address. The entire

Can you avoid sampling bias? �The next slide shows Lincoln’s Gettysburg Address. The entire population, all words in his address, will be shown to you. What is the average word length? �Your task: Select a sample of 10 words that resemble the overall address. Write them down. �Calculate the average number of letters for the words in your sample �Place a dot above your sample average on the board Statistics: Unlocking the Power of Data Lock 5

Lincoln’s Gettysburg Address “Four score and seven years ago our fathers brought forth, on

Lincoln’s Gettysburg Address “Four score and seven years ago our fathers brought forth, on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate— we can not consecrate—we can not hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they here gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth. ” Statistics: Unlocking the Power of Data Lock 5

Can you avoid sampling bias? �Actual average: 4. 29 letters �People are TERRIBLE at

Can you avoid sampling bias? �Actual average: 4. 29 letters �People are TERRIBLE at selecting a good sample, even when explicitly trying to avoid sampling bias! �We need a better way… Statistics: Unlocking the Power of Data Lock 5

Random Sampling �How can we make sure to avoid sampling bias? Take a RANDOM

Random Sampling �How can we make sure to avoid sampling bias? Take a RANDOM sample! �Imagine putting the names of all the units of the population into a hat, and drawing out names at random to be in the sample �More often, we use technology Statistics: Unlocking the Power of Data Lock 5

Random Sampling �Before the 2008 election, the Gallup Poll took a random sample of

Random Sampling �Before the 2008 election, the Gallup Poll took a random sample of 2, 847 Americans. 52% of those sampled supported Obama �In the actual election, 53% voted for Obama �Random sampling is a very powerful tool!!! Statistics: Unlocking the Power of Data Lock 5

“Random” Numbers 1. Pick 10 “random” numbers between 1 and 268. Write these numbers

“Random” Numbers 1. Pick 10 “random” numbers between 1 and 268. Write these numbers down. (Note: When choosing a real sample, you should use technology to generate random numbers. This is simply for illustrative purposes in class. ) 2. Using the next slide, calculate the average number of letters in the words corresponding to your random numbers 3. Place a dot below this average on the board Statistics: Unlocking the Power of Data Lock 5

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Four score and seven years ago, our fathers brought forth upon this continent a new nation: conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged 35 in 69 dedicate 36 a 70 a 37 great 71 portion 38 civil 72 of 39 war, 73 that 40 testing 74 field 41 whether 75 as 42 that 76 a 43 nation, 77 final 44 or 78 resting 45 any 79 place 46 nation 80 for 47 so 81 those 48 conceived 82 who 49 and 83 here 50 so 84 gave 51 dedicated, 85 their 52 can 86 lives 53 long 87 that 54 endure. 88 that 55 We 89 nation 56 are 90 might 57 met 91 live. 58 on 92 It 59 a 93 is 60 great 94 altogether 61 battlefield 95 fitting 62 of 96 and 63 that 97 proper 64 war. 98 that 65 We 99 we 66 have 100 should 67 come 101 do 68 to 102 this. 103 But, 104 in 105 a 106 larger 107 sense, 108 we 109 cannot 110 dedicate, 111 we 112 cannot 113 consecrate, 114 we 115 cannot 116 hallow 117 this 118 ground. 119 The 120 brave 121 men, 122 living 123 and 124 dead, 125 who 126 struggled 127 here 128 have 129 consecrated 130 it, 131 far 132 above 133 our 134 poor 135 power 136 to Statistics: Unlocking the Power of Data 137 add 138 or 139 detract. 140 The 141 world 142 will 143 little 144 note, 145 nor 146 long 147 remember, 148 what 149 we 150 say 151 here, 152 but 153 it 154 can 155 never 156 forget 157 what 158 they 159 did 160 here. 161 It 162 is 163 for 164 us 165 the 166 living, 167 rather, 168 to 169 be 170 dedicated 171 here 172 to 173 the 174 unfinished 175 work 176 which 177 they 178 who 179 fought 180 here 181 have 182 thus 183 far 184 so 185 nobly 186 advanced. 187 It 188 is 189 rather 190 for 191 us 192 to 193 be 194 here 195 dedicated 196 to 197 the 198 great 199 task 200 remaining 201 before 202 us, 203 that 204 from 205 these 206 honored 207 dead 208 we 209 take 210 increased 211 devotion 212 to 213 that 214 cause 215 for 216 which 217 they 218 gave 219 the 220 last 221 full 222 measure 223 of 224 devotion, 225 that 226 we 227 here 228 highly 229 resolve 230 that 231 these 232 dead 233 shall 234 not 235 have 236 died 237 in 238 vain, 239 that 240 this 241 nation, 242 under 243 God, 244 shall 245 have 246 a 247 new 248 birth 249 of 250 freedom, 251 and 252 that 253 government 254 of 255 the 256 people, 257 by 258 the 259 people, 260 for 261 the 262 people, 263 shall 264 not 265 perish 266 from 267 the 268 earth. Lock 5

Lincoln’s Gettysburg Address Fall 2011: Spring 2012: Statistics: Unlocking the Power of Data Lock

Lincoln’s Gettysburg Address Fall 2011: Spring 2012: Statistics: Unlocking the Power of Data Lock 5

Random vs Non-Random Sampling � Random samples have averages that are centered around the

Random vs Non-Random Sampling � Random samples have averages that are centered around the correct number � Non-random samples may suffer from sampling bias, and averages may not be centered around the correct number � Only random samples can truly be trusted when making generalizations to the population! Statistics: Unlocking the Power of Data Lock 5

Bowl of Soup Analogy Think of tasting a bowl of soup… � Population =

Bowl of Soup Analogy Think of tasting a bowl of soup… � Population = entire bowl of soup � Sample = whatever is in your tasting bites � If you take bites non-randomly from the soup (if you stab with a fork, or prefer noodles to vegetables), you may not get a very accurate representation of the soup � If you take bites at random, only a few bites can give you a very good idea for the overall taste of the soup Statistics: Unlocking the Power of Data Lock 5

Simple Random Sample �In a simple random sample, each unit of the population has

Simple Random Sample �In a simple random sample, each unit of the population has the same chance of being selected, regardless of the other units chosen for the sample �More complicated random sampling schemes exist, but will not be covered in this course Statistics: Unlocking the Power of Data Lock 5

Realities of Sampling �While a random sample is ideal, often it isn’t feasible. A

Realities of Sampling �While a random sample is ideal, often it isn’t feasible. A list of the entire population may not be available, or it may be impossible or too difficult to contact all members of the population. �Sometimes, your population of interest has to be altered to something more feasible to sample from. Generalization of results are limited to the population that was actually sampled from. �In practice, think hard about potential sources of sampling bias, and try your best to avoid them Statistics: Unlocking the Power of Data Lock 5

Non-Random Samples Suppose you want to estimate the average number of hours that Duke

Non-Random Samples Suppose you want to estimate the average number of hours that Duke students spend studying each week. Which of the following is the best method of sampling? (a) Go to the library and ask all the students there how much they study (b) Email all Duke students asking how much they study, and use all the data you get (c) Give a clicker question in STAT 101 and force every student to respond (d) Stand outside the Bryan Center and ask everyone going in how much they study Statistics: Unlocking the Power of Data Lock 5

Bad Methods of Sampling �Sampling units based on something obviously related to the variable(s)

Bad Methods of Sampling �Sampling units based on something obviously related to the variable(s) you are studying Sampling only students in the library when asking how much they study, or sampling only students taking a statistics class “Today’s Poll” on fitnessmagazine. com asked “Have you ever hired a personal trainer? ”. 27% of respondents said “yes” – can we infer that 27% of all humans have hired a personal trainer? Statistics: Unlocking the Power of Data Lock 5

Bad Methods of Sampling �Letting your sample be comprised of whoever chooses to participate

Bad Methods of Sampling �Letting your sample be comprised of whoever chooses to participate (volunteer bias) �People who chose to participate or respond are probably not representative of the entire population Emailing or mailing the entire population, and then making conclusions about the population based on whoever chooses to respond Example: An airline emails all of it’s customers asking them to rate their satisfaction with their recent travel Statistics: Unlocking the Power of Data Lock 5

Alcohol, Marijuana, and Driving �The Federal Office of Road Safety in Australia conducted a

Alcohol, Marijuana, and Driving �The Federal Office of Road Safety in Australia conducted a study on the effects of alcohol and marijuana on performance �Volunteers who responded to advertisements for the study on rock radio stations were given a random combination of the two drugs, then their performance was observed What is the sample? What is the population? Is there sampling bias? Will the results be informative and/or do you think the study is worth conducting? Source: Chesher, G. , Dauncey, H. , Crawford, J. and Horn, K, “The Interaction between Alcohol and Marijuana: A Dose Dependent Study on the Effects of Human Moods and Performance Skills, ” Report No. C 40, Federal Office of Road Safety, Federal Department of Transport, Australia, 1986. Statistics: Unlocking the Power of Data Lock 5

Papers �Note: The original sources for the studies are provided and linked when possible

Papers �Note: The original sources for the studies are provided and linked when possible - if interested in the details of the study, please check out the original article! Statistics: Unlocking the Power of Data Lock 5

Data Collection and Bias Population Sampling Bias? Sample Other forms of bias? DATA Statistics:

Data Collection and Bias Population Sampling Bias? Sample Other forms of bias? DATA Statistics: Unlocking the Power of Data Lock 5

Other Forms of Bias �Even with a random sample, data can still be biased,

Other Forms of Bias �Even with a random sample, data can still be biased, especially when collected on humans �Other forms of bias to watch out for in data collection: Question wording Context Inaccurate responses Many other possibilities – examine the specifics of each study! Statistics: Unlocking the Power of Data Lock 5

Question Wording �A random sample was asked: “Should there be a tax cut, or

Question Wording �A random sample was asked: “Should there be a tax cut, or should money be used to fund new government programs? ” Tax Cut: 60% Programs: 40% �A different random sample was asked: “Should there be a tax cut, or should money be spent on programs for education, the environment, health care, crime-fighting, and military defense? ” Tax Cut: 22% Programs: 78% Statistics: Unlocking the Power of Data Lock 5

Context �Ann Landers column asked readers “If you had it to do over again,

Context �Ann Landers column asked readers “If you had it to do over again, would you have children? �The first request for data contained a letter from a young couple which listed worries about parenting and various reasons not to have kids Þ 30% said “yes” • The second request for data was in response to this number, in which Ann wrote how she was “stunned, disturbed, and just plain flummoxed” Þ 95% said “yes” Statistics: Unlocking the Power of Data Lock 5

Having Children If we were to run the question all by itself in the

Having Children If we were to run the question all by itself in the newspaper with a request for responses, could we trust the results? (a) Yes (b) No This would suffer from volunteer bias. We need a random sample. Statistics: Unlocking the Power of Data Lock 5

Having Children Newsday conducted a random sample of all US adults, and asked them

Having Children Newsday conducted a random sample of all US adults, and asked them the same question, without any additional leading material Þ 91% said “yes” Do you think the true proportion of US parents who are happy they had children is close to 91%? a) Yes Because this is a random sample, the b) No population proportion should be close to the sample proportion. Statistics: Unlocking the Power of Data Lock 5

Inaccurate Responses �In a study on US students, 93% of the sample said they

Inaccurate Responses �In a study on US students, 93% of the sample said they were in the top half of the sample regarding driving skill Svenson, O. (February 1981). "Are we all less risky and more skillful than our fellow drivers? " Acta Psychologica 47 (2): 143– 148. Statistics: Unlocking the Power of Data Lock 5

Summary Always think critically about how the data were collected, and recognize that not

Summary Always think critically about how the data were collected, and recognize that not all forms of data collection lead to valid inferences �This is the easiest way to instantly become a more statistically literate individual! Statistics: Unlocking the Power of Data Lock 5

To Do �Read Section 1. 2 �Complete the class survey (due Tuesday, 9/4) �If

To Do �Read Section 1. 2 �Complete the class survey (due Tuesday, 9/4) �If you haven’t already… Get the textbook Get a clicker and register it Do Lab 0 TODAY Statistics: Unlocking the Power of Data Lock 5