STT 200 STATISTICAL METHODS Chapter 1 Introduction to























































- Slides: 55
STT 200 STATISTICAL METHODS Chapter 1: Introduction to Data
Section 1. 1: Case Study- Dolphin Therapy ■ Is swimming with dolphins therapeutic for patients suffering from clinical depression? ■ Researchers recruited 30 subjects aged 18 -65 with a clinical diagnosis of mild to moderate depression (Antonioli & Reveley, 2005). ■ Subjects were required to discontinue use of any antidepressant drugs or psychotherapy four weeks prior to the experiment, and throughout the experiment. ■ These 30 subjects went to an island off the coast of Honduras, where they were randomly assigned to one of two treatment groups. ■ Both groups engaged in the same amount of swimming and snorkeling each day (the outdoor nature program), but one group did so in the presence of bottlenose dolphins and the control group did not. ■ Each subjects’ level of depression was evaluated at the beginning of the study and then again at the end. ■ In the dolphin therapy group, 10 of the 15 participants showed substantial improvement, while in the control group, 3 of 15 participants showed substantial improvement.
Organize the information in a contingency table Treatment Dolphin therapy Control group Total Showed Did not show Total substantial improvement
Dolphin Therapy questions a. What proportion of patients receiving dolphin therapy showed substantial improvement? b. What proportion of patients in the control group showed substantial improvement? c. Does the dolphin therapy appear to be more effective?
Section 1. 2: Data Basics ■
Ames Dataset Each row in the table represents a single home or__________. Each column records a characteristic of this observation, called a __________. Together, this table represents a _________ , which is used to organize raw data. Overall Quality Exterior Material Total Rooms Living Area Sale Price 7 8 4 6 7 7 9 9 3 6 7 7 5 5 4 5 7 7 5 5 7 6 7 6 6 8 5 8 7 7 8 5 5 7 6 5 7 8 4 8 5 7 Vinyl Siding Wood Siding Brick Face Vinyl Siding Vinyl Siding Metal Siding Vinyl Siding Wood Siding Metal Siding Hard Board Plywood Metal Siding Vinyl Siding Wood Siding Plywood Vinyl Siding Wood Siding Vinyl Siding Hard Board Cement Board Hard Board Wood Siding Vinyl Siding Hard Board Vinyl Siding Wood Siding Metal Siding Vinyl Siding 7 4 6 7 6 5 5 7 6 7 8 10 8 8 4 5 4 6 6 6 8 6 7 6 6 7 5 7 8 4 6 8 8 5 9 6 5 6 7 4 10 7 6 1749 1235 1040 1677 1314 1425 1419 1746 1183 1636 1809 2495 1803 1505 936 884 1055 1446 1009 1551 2161 1329 2064 1162 1626 1614 1309 1604 1818 879 1595 1659 2050 1646 1656 1056 2049 1092 1052 1232 1872 672 2358 1430 1145 196000 214000 90000 190500 207500 193000 290000 377500 120000 191000 209700 184000 155000 145000 111900 144000 142500 175000 160000 137000 230500 172000 235000 170000 154000 245700 181000 187500 281500 80000 221500 181000 154000 246000 135000 145000 210900 105500 138500 194500 261500 72000 329900 173500 160000
Think about it! ■ Could you compute the “AVERAGE LIVING AREA” for these 45 homes? ■ Could you compute the “AVERAGE EXTERIOR MATERIAL” for these 45 homes? ■ LIVING AREA is said to be a ________ variable. ■ EXTERIOR MATERIAL is a ________ Overall Quality Exterior Material Total Rooms Living Area Sale Price 7 8 4 6 7 7 9 9 3 6 7 7 5 5 4 5 7 7 5 5 7 6 7 6 6 8 5 8 7 7 8 5 5 7 6 5 7 8 4 8 5 7 Vinyl Siding Wood Siding Brick Face Vinyl Siding Vinyl Siding Metal Siding Vinyl Siding Wood Siding Metal Siding Hard Board Plywood Metal Siding Vinyl Siding Wood Siding Plywood Vinyl Siding Wood Siding Vinyl Siding Hard Board Cement Board Hard Board Wood Siding Vinyl Siding Hard Board Vinyl Siding Wood Siding Metal Siding Vinyl Siding 7 4 6 7 6 5 5 7 6 7 8 10 8 8 4 5 4 6 6 6 8 6 7 6 6 7 5 7 8 4 6 8 8 5 9 6 5 6 7 4 10 7 6 1749 1235 1040 1677 1314 1425 1419 1746 1183 1636 1809 2495 1803 1505 936 884 1055 1446 1009 1551 2161 1329 2064 1162 1626 1614 1309 1604 1818 879 1595 1659 2050 1646 1656 1056 2049 1092 1052 1232 1872 672 2358 1430 1145 196000 214000 90000 190500 207500 193000 290000 377500 120000 191000 209700 184000 155000 145000 111900 144000 142500 175000 160000 137000 230500 172000 235000 170000 154000 245700 181000 187500 281500 80000 221500 181000 154000 246000 135000 145000 210900 105500 138500 194500 261500 72000 329900 173500 160000
Types of Variables All variables are either numerical or categorical. We can sort variables further into the subgroups. all variables numerical discrete continuou s categoric al regular or nominal ordinal
Numerical variables: discrete ■ Numerical variables can be divided further into two groups: ___discrete_____ and _____continuous____ Discrete numerical variables can only take on certain values within an interval, ‘jumping’ from one value to the next. Ex. The number of siblings you have could be ‘ 2’ or ‘ 3’, but could not be any of the values in-between these two possibilities.
Numerical variables: continuous ■ Numerical variables can be divided further into two groups: ___discrete_____ and _____continuous____ Continuous numerical variables can take on any value within an interval. Ex. The weight of your youngest sibling [when they were just born] could be 7 lbs or 9 lbs, and it could also be any value in-between these two possibilities.
Categorical variables nominal, or regular ■ Categorical variables can be divided further into two groups: general (nominal, categorical) and ordinal General [nominal, or regular] categorical variables simply assign an observation to a particular group or category. Each possible group a categorical variable can take on is called a level of that variable. Ex. The brand of soda you order with a take-out meal {Coke, Diet Coke, Sprite, etc. }
Ordinal variables ■ Categorical variables can be divided further into two groups: general (nominal, categorical) and ordinal Some categorical variables have levels that present a logical order. These are referred to as ordinal variables. Ex. The size of soda you order with a take-out meal {Small, Medium, Large, Bathtub, etc. }
Ames dataset variables ■ The Ames dataset has an example of each of these variable types (plus an extra of one of them!) Variable Name Overall Quality Exterior Material Total Rooms Living Area Sale Price Variable Type
Example: Types of Variables Consider each of the variables listed and identify them as either numerical discrete, numerical continuous, regular categorical, or ordinal. – Your income (in USD) – Your typical classroom seat location (Front, Middle, Back) – Number of contacts saved to your cell phone – Area code in which you were born – Time spent studying material for this class in the last 24 -hour period (in hours)
Section 1. 3: Data Collection Principles ■ How is global warming influencing the health of coral reefs in the Pacific Ocean? ■ Over the last 5 years, what is the average time to complete a degree for MSU undergraduate students? ■ Does a new drug reduce the number of deaths in patients with severe heart disease? Each question refers to a _________. Instead of measuring every item in the population we take a _________.
Sampling from populations Consider each of the following samples: ■ Last year was the warmest year on record at a local Hawaiian beach resort, and the nearby coral reef was completely dead. Thus, rising temperatures must be damaging the health of coral reefs in the Pacific. ■ I met two MSU students who took more than 7 years to graduate from MSU, so it must take longer to graduate at MSU than at other colleges. ■ My friend’s dad had a heart attack and died after they gave him a new heart disease drug, so the drug must not work. Each of these conclusions is based on ___________________.
Key Idea: Golden Rule for Data Collection In order to draw inferences about a population from a sample, the sample must be _________ of the population it came from. The best way to ensure a sample is representative is to ensure the observations that comprise it were selected _________. Representative samples allow us to __________ the results from the sample to
Observational Study vs. Experiments ■ Observational studies refer to instances where researchers collect data in a way that does not directly interfere with how the data arise. ■ Experiments refer to instances in which researchers directly influence the process by which data arise. Usually, this involves assigning subjects to one or more treatments.
Observational Study vs. Experiments ■ We’ll soon learn key distinguishing factors between observational studies and experiments: – Assignment of explanatory variable ■ Experiments: researchers decide ■ Observational studies: subjects decide – Causal inference ■ Observational studies: Don’t do it! ■ Experiments: It might be ok, but replicate it first!
Simple Random Sampling ■ A simple random sample of n observations from a population is one in which each possible sample of that size _____________________. ■ Simple random sampling is the best way of ensuring that your sample is ______________________it is chosen from. ■ Sampling this way allows us to generalize our results from the small sample to the larger population.
Example: MLB player salaries Consider the salaries of Major League Baseball (MLB) players, where each player is a member of one of 30 teams. ■ To conduct a simple random sample, we might write last year’s salary for each of the MLB players on a scrap of paper and randomly jumble them in a bag. ■ Thereafter, we could blindly pull a sample of n of these paper scraps out of the bag.
Stratified Sampling ■ In stratified sampling, the population is divided into groups (or strata), such that each stratum shares some common characteristic. ■ This method works well when there is a lot of variation between each stratum, but not much variation within each one. ■ Example: political survey seeking potential voters Remember: we want to stratify so that observations in each stratum _________ with respect to variable of interest. By age? By gender? By race?
Example: MLB Salaries ■ Instead of randomly sampling from our population of MLB salaries blindly. ■ We could categorize (or stratify) the salaries according to the position played by each player (pitcher, short-stop, catcher, etc. ). ■ Alternatively, we could stratify the salaries according to players’ teams. Ideally, we’d like to choose a way of stratifying the population such that all the observations in each stratum _______________________ with respect to the outcome of interest.
Stratified Sampling
Cluster Sampling ■ The population is grouped into clusters that are convenient to sample ■ Then, randomly sample some of the clusters and then observe all the elements in the chosen clusters. ■ This method works best when there is a lot of variability within each cluster, and not a lot of variability between clusters. ■ Example: Sampling school students in a given school district ■ Example: Forestry
Example: Michigan elementary schools ■ Suppose we’d like to estimate the proportion of Michigan elementary school students who eat school-provided lunches. ■ However, each elementary school can be considered a cluster of observations from the population of interest. ■ Using a cluster sampling method, we would randomly select a sample of schools, then survey all the students in each of the selected schools.
Cluster Sampling
Systematic Sampling ■ Works well when your population can be ordered in some way. ■ Randomly select a fixed starting point and then select every kth person in the population to be in the sample. ■ Creates a sample that is similar to a simple random sample, since every member of the population has an equal chance of being chosen for the sample. ■ It is not the same as an SRS because not every possible sample has the same probability of being selected. ■ Example: production line
Systematic Sampling Example: Hospital employees Suppose we have an alphabetized list of 1500 employees who work for a large hospital and we want to sample 100 of these employees. For a systematic sample, we would choose a random number between 1 and 15 as a starting point, and then select every 15 th person on the list to be in our sample.
Systematic Sampling
Convenience Sampling ■ Samples that are obtained by measuring whatever or whoever is available to be measured. ■ Rarely (if ever) representative of a larger population. ■ Example:
Example: Sampling Strategies (a) A stats class has roughly 360 students, who attend one lecture but are divided into recitation sections led by different TAs. The professor of this course wants to learn how satisfied the students are with the class. a. How might the professor obtain a SRS of the class?
Example: Sampling Strategies (b) A stats class has roughly 360 students, who attend one lecture but are divided into recitation sections led by different TAs. The professor of this course wants to learn how satisfied the students are with the class. b. If the professor believes that recitation section might influence student satisfaction levels, what type of sampling method might be appropriate? Why?
Example: Sampling Strategies (c) A stats class has roughly 360 students, who attend one lecture but are divided into recitation sections led by different TAs. The professor of this course wants to learn how satisfied the students are with the class. c. If the professor believes that student’s declared majors might influence their satisfaction levels, what type of sampling method might be appropriate? Why?
Bias: When randomization fails ■ Selection bias occurs if the method for selecting the participants produces a sample that does not represent the population of interest. ■ Nonresponse bias occurs when a representative sample is chosen for a survey, but a subset cannot be contacted or does not respond. ■ Response bias occurs when participants respond differently from how they truly feel. The way questions are worded, the way the interviewer behaves, as well as many other factors might lead an individual to provide false information.
Response bias - examples (Bad wording, or an order of questions suggests a response) Examples: ■ Bad wording (“Loaded question”): 97% yes: “Should the President have the line item veto to eliminate waste? ” 57% yes: “Should the President have the line item veto, or not? ” ■ The order: ”Does traffic contribute more or less to air pollution than industry? ” “Does industry contribute more or less to air pollution than traffic? ”
Response Bias Hack ■ https: //www. youtube. com/watch? v=nw. J 0 q. Y_r. P 0 A
Key Idea: Explanatory & Response Variables ■ To identify the explanatory variable in a pair of variables, identify which of the two is suspected of affecting the other. explanator y variable might affect response variable
Explanatory and response variables ■ Consider a research question recently showcased in the scientific journal Nature: – Are infants raised without exposure to total darkness more likely to suffer from myopia (short-sightedness) later in life? ■ The researchers suspected darkness during sleep influences vision later in life, so darkness can be considered the ___________ variable of the study and vision can be considered the ____________ variable.
Light exposure and myopia ■ In our study above, researchers recorded whether each child selected to be in the study slept with or without a night light. ■ They returned to each child years later and found that those who slept with night-lights were much more likely to have developed myopia [shortsightedness] and need glasses. ■ Should we conclude that too much light exposure at night increases the risk of an infant developing short-sightedness later in life?
Limits of Observational Studies ■ It is tempting to conclude sleeping with the lights on has a _______relationship with myopia. ■ In fact, many of the infants studied were the children of myopic parents, who were more likely to leave a night-light on when caring for their children at an early age. ■ The parents’ vision in this study is an example of a ________ variable.
Confounding variables ■ Confounding variables are variables that are associated with both the explanatory and response variables, but go unnoticed by the researcher. ■ They are the main reason why it is so difficult to demonstrate a causal relationship with an observational study.
Find the confounding variable (a) ■ An observational study tracked sunscreen use and melanoma (skin cancer) diagnoses, and it later found that increases in sunscreen use led to an increased risk of skin cancer. – Explanatory: – Response: – Confounding:
Find the confounding variable (b) ■ Researchers conducted a random sample of rural residents and recorded observations of their sleeping habits (quality and average length of nightly sleep) along with whether each participant was obese, overweight, or of a healthy weight. Results suggested that sufficient lack of nightly sleep increases an individual’s risk of obesity. – Explanatory: – Response: – Confounding:
Find the confounding variable (c) ■ A well-known wives tale among palm-readers states that those with long lines across their palms will reliably live a long life. Researchers at a teaching hospital took a random sample of human autopsies, which included an individual’s age of death, along with many weight, width, and girth measurements of their corpse. Contrary to the opinion of palmreaders, they found that individuals with smaller palm widths had longer average life expectancies. – Explanatory: – Response: – Confounding:
Why conduct observational studies? ■ Observational studies are good at showing __________ a relationship exists, and are terrible at showing _________ a relationship exists, but are still worth conducting when studying many scientific phenomenon.
Controlled experiments ■ Controlled experiments can demonstrate ______________ between variables. ■ Instead of gathering data from simple observations [as in observational studies], researchers _______________treatments to different cases and observe the treatment’s effect.
Example: vaping and antidepressants ■ Suppose you have 100 college students who would like to quit vaping, and suspect that administering an antidepressant daily might lessen withdrawal symptoms and help them to quit. ■ Think about it! ■ How might you determine whether antidepressants lessen withdrawal symptoms?
Vaping and antidepressants ■ Identify explanatory & response variables: ■ Explanatory: – application of daily antidepressants for 30 days ■ Response: – whether participant has relapsed after 30 days
Experimental design principle 1 1. Controlling: Control for possible confounding variables. ■ The response is measured against a control group that is not administered a treatment. – Half of our college students should receive the treatment – Half of our college students should receive a ‘placebo’ ■ Additionally, researchers attempt to control other variables that might influence the response. For our study, we might restrict access to other substances they believe influence whether they relapse back to vaping.
Placebos and the placebo effect ■ What is a placebo? – A placebo is a fake or sham treatment, something designed to look like the treatment, but lacking the active ingredient. ■ Why use a placebo? – People are known to respond just to the idea of getting a treatment. – If everyone receives a “treatment” then the mental effect of treatment is spread evenly in both treatment and control groups.
Blind and double blind studies ■ Blind: When a patient does not know if they are in the treatment or the control group. – Placebos are generally used in blind studies ■ Double blind: When both patients and evaluators don’t know if the patient is in the treatment or control group – Eliminates doctor/researcher bias when evaluating symptoms ■ Practical and ethical limitations
Experimental design principle 2 ■ 2. Randomization randomize the assignment of treatments to each of its cases to account for confounding variables that cannot be controlled. ■ Some students might be more likely to relapse into vaping because individuals in their friend group also vape. ■ Randomizing students into the treatment and control groups helps ‘even out’ these differences.
Experimental design principle 3 ■
Experimental design principle 4 ■ 4. Blocking: Blocking tries to address the same problem as randomization through the opposite means. ■ Instead of randomizing students into the control and treatment groups, we might first block them into two categories – those with friends who vape and those without – and then randomly assign students in each of these categories to the control and treatment groups.