STAT 101 Dr Kari Lock Morgan Exam 2

  • Slides: 36
Download presentation
STAT 101 Dr. Kari Lock Morgan Exam 2 Review Statistics: Unlocking the Power of

STAT 101 Dr. Kari Lock Morgan Exam 2 Review Statistics: Unlocking the Power of Data Lock 5

Exam Details �Wednesday, 4/2 • Closed to everything except two double-sided pages of notes

Exam Details �Wednesday, 4/2 • Closed to everything except two double-sided pages of notes and a non-cell phone calculator • • page of notes should be prepared by you – no sharing Okay to use materials from class for your page of notes • Best ways to prepare: • #1: WORK LOTS OF PROBLEMS! • Make a good page of notes • Read sections you are still confused about • Come to office hours and clarify confusion �Cumulative, but emphasis is on material since Exam 1 (Chapters 5 -9, we skipped 8. 2 and 9. 2) Statistics: Unlocking the Power of Data Lock 5

Practice Problems • Practice exam online (under resources) • Solutions to odd essential synthesis

Practice Problems • Practice exam online (under resources) • Solutions to odd essential synthesis and review problems online (under resources) • Solutions to all odd problems in the book on reserve at Perkins Statistics: Unlocking the Power of Data Lock 5

Office Hours and Help �Monday 3 – 4 pm: Prof Morgan, Old Chem 216

Office Hours and Help �Monday 3 – 4 pm: Prof Morgan, Old Chem 216 �Monday 4– 6 pm: Stephanie Sun, Old Chem 211 A �Tuesday 3– 5 pm (extra): Prof Morgan, Old Chem 216 �Tuesday 5 -7 pm: Wenjing Shi, Old Chem 211 A �Tuesday 7 -9 pm: Mao Hu, Old Chem 211 A �REVIEW SESSION: 5– 6 pm Tuesday, Social Sciences 126 Statistics: Unlocking the Power of Data Lock 5

Stat Education Center Reminder: the Stat Education Center in Old Chem 211 A is

Stat Education Center Reminder: the Stat Education Center in Old Chem 211 A is open Sunday – Thurs 4 pm – 9 pm with stat majors and stat Ph. D students available to answer questions Statistics: Unlocking the Power of Data Lock 5

Two Options for p-values We have learned two ways of calculating p-values: The only

Two Options for p-values We have learned two ways of calculating p-values: The only difference is how to create a distribution of the statistic, assuming the null is true: 1) Simulation (Randomization Test): • Directly simulate what would happen, just by random chance, if the null were true 2) Formulas and Theoretical Distributions: • Use a formula to create a test statistic for which we know theoretical distribution when the null is true, if sample sizes are large enough Statistics: Unlocking the Power of Data Lock 5

Two Options for Intervals We have learned two ways of calculating intervals: 1) Simulation

Two Options for Intervals We have learned two ways of calculating intervals: 1) Simulation (Bootstrap): • Assess the variability in the statistic by creating many bootstrap statistics 2) Formulas and Theoretical Distributions: • Use a formula to calculate the standard error of the statistic, and use the normal or tdistribution to find z* or t*, if sample sizes are large enough Statistics: Unlocking the Power of Data Lock 5

Pros and Cons 1) Simulation Methods PROS: • Methods tied directly to concepts, emphasizing

Pros and Cons 1) Simulation Methods PROS: • Methods tied directly to concepts, emphasizing conceptual understanding • Same procedure for every statistic • No formulas or theoretical distributions to learn and distinguish between • Minimal math needed CONS: • Need entire dataset (if quantitative variables) • Need a computer • Newer approach Statistics: Unlocking the Power of Data Lock 5

Pros and Cons 2) Formulas and Theoretical Distributions PROS: • Only need summary statistics

Pros and Cons 2) Formulas and Theoretical Distributions PROS: • Only need summary statistics • Only need a calculator • More commonly used CONS: • Plugging numbers into formulas does little for conceptual understanding • Many different formulas and distributions to learn and distinguish between • Harder to see the big picture when the details are different for each statistic • Doesn’t work for small sample sizes • Requires more math and background knowledge Statistics: Unlocking the Power of Data Lock 5

Accuracy • The accuracy of simulation methods depends on the number of simulations (more

Accuracy • The accuracy of simulation methods depends on the number of simulations (more simulations = more accurate) • The accuracy of formulas and theoretical distributions depends on the sample size (larger sample size = more accurate) • If the sample size is large and you have generated many simulations, the two methods should give essentially the same answer Statistics: Unlocking the Power of Data Lock 5

Data Collection Was the sample randomly selected? Yes No Possible to generalize to the

Data Collection Was the sample randomly selected? Yes No Possible to generalize to the population Should not generalize to the population Statistics: Unlocking the Power of Data Was the explanatory variable randomly assigned? Yes Possible to make conclusions about causality No Can not make conclusions about causality Lock 5

Variable(s) Visualization Summary Statistics Categorical bar chart, pie chart frequency table, relative frequency table,

Variable(s) Visualization Summary Statistics Categorical bar chart, pie chart frequency table, relative frequency table, proportion Quantitative dotplot, histogram, boxplot mean, median, max, min, standard deviation, z -score, range, IQR, five number summary Categorical vs Categorical side-by-side bar chart, two-way table, difference segmented bar chart in proportions Quantitative vs side-by-side boxplots Categorical Quantitative vs Quantitative scatterplot Statistics: Unlocking the Power of Data statistics by group, difference in means correlation, simple linear regression Lock 5

Confidence Interval • A confidence interval for a parameter is an interval computed from

Confidence Interval • A confidence interval for a parameter is an interval computed from sample data by a method that will capture the parameter for a specified proportion of all samples • A 95% confidence interval will contain the true parameter for 95% of all samples Statistics: Unlocking the Power of Data Lock 5

Hypothesis Testing • How unusual would it be to get results as extreme (or

Hypothesis Testing • How unusual would it be to get results as extreme (or more extreme) than those observed, if the null hypothesis is true? • If it would be very unusual, then the null hypothesis is probably not true! • If it would not be very unusual, then there is not evidence against the null hypothesis Statistics: Unlocking the Power of Data Lock 5

p-value • The p-value is the probability of getting a statistic as extreme (or

p-value • The p-value is the probability of getting a statistic as extreme (or more extreme) as that observed, just by random chance, if the null hypothesis is true • The p-value measures evidence against the null hypothesis Statistics: Unlocking the Power of Data Lock 5

Hypothesis Testing 1. State Hypotheses 2. Calculate a test statistic, based on your sample

Hypothesis Testing 1. State Hypotheses 2. Calculate a test statistic, based on your sample data 3. Create a distribution of this test statistic, as it would be observed if the null hypothesis were true 4. Use this distribution to measure how extreme your test statistic is Statistics: Unlocking the Power of Data Lock 5

Distribution of the Sample Statistic 1. Sampling distribution: distribution of the statistic based on

Distribution of the Sample Statistic 1. Sampling distribution: distribution of the statistic based on many samples from the population 2. Bootstrap Distribution: distribution of the statistic based on many samples with replacement from the original sample 3. Randomization Distribution: distribution of the statistic assuming the null hypothesis is true 4. Normal, t, 2, F: Theoretical distributions used to approximate the distribution of the statistic Statistics: Unlocking the Power of Data Lock 5

Sample Size Conditions • For large sample sizes, either simulation methods or theoretical methods

Sample Size Conditions • For large sample sizes, either simulation methods or theoretical methods work • If sample sizes are too small, only simulation methods can be used Statistics: Unlocking the Power of Data Lock 5

Using Distributions • For confidence intervals, you find the desired percentage in the middle

Using Distributions • For confidence intervals, you find the desired percentage in the middle of the distribution, then find the corresponding value on the x-axis • For p-values, you find the value of the observed statistic on the x-axis, then find the area in the tail(s) of the distribution Statistics: Unlocking the Power of Data Lock 5

Confidence Intervals Statistics: Unlocking the Power of Data Lock 5

Confidence Intervals Statistics: Unlocking the Power of Data Lock 5

Confidence Intervals Return to original scale with Statistics: Unlocking the Power of Data Lock

Confidence Intervals Return to original scale with Statistics: Unlocking the Power of Data Lock 5

Hypothesis Testing Statistics: Unlocking the Power of Data Lock 5

Hypothesis Testing Statistics: Unlocking the Power of Data Lock 5

General Formulas • When performing inference for a single parameter (or difference in two

General Formulas • When performing inference for a single parameter (or difference in two parameters), the following formulas are used: Statistics: Unlocking the Power of Data Lock 5

General Formulas • For proportions (categorical variables) with only two categories, the normal distribution

General Formulas • For proportions (categorical variables) with only two categories, the normal distribution is used • For inference involving any quantitative variable (means, correlation, slope), if categorical variables only have two categories, the t distribution is used Statistics: Unlocking the Power of Data Lock 5

Standard Error • The standard error is the standard deviation of the sample statistic

Standard Error • The standard error is the standard deviation of the sample statistic • The formula for the standard error depends on the type of statistic (which depends on the type of variable(s) being analyzed) Statistics: Unlocking the Power of Data Lock 5

Standard Error Formulas Parameter Distribution Proportion Standard Error Normal Difference in Proportions Normal Mean

Standard Error Formulas Parameter Distribution Proportion Standard Error Normal Difference in Proportions Normal Mean t, df = n – 1 Difference in Means t, df = min(n 1, n 2) – 1 Correlation t, df = n – 2 Statistics: Unlocking the Power of Data Lock 5

Multiple Categories • These formulas do not work for categorical variables with more than

Multiple Categories • These formulas do not work for categorical variables with more than two categories, because there are multiple parameters • For one or two categorical variables with multiple categories, use 2 tests (goodness of fit for one categorical variable, test for association for two) • For testing for a difference in means across multiple groups, use ANOVA Statistics: Unlocking the Power of Data Lock 5

Chi-Square Test for Goodness of Fit 1. State null hypothesized proportions for each category,

Chi-Square Test for Goodness of Fit 1. State null hypothesized proportions for each category, pi. Alternative is that at least one of the proportions is different than specified in the null. 2. Calculate the expected counts for each cell as npi. Make sure they are all greater than 5 to proceed. 3. Calculate the 2 statistic: 4. Compute the p-value as the area in the tail above the 2 statistic, for a 2 distribution with df = (# of categories – 1) 5. Interpret the p-value in context. Statistics: Unlocking the Power of Data Lock 5

Chi-Square Test for Association 1. H 0 : The two variables are not associated

Chi-Square Test for Association 1. H 0 : The two variables are not associated Ha : The two variables are associated 2. Calculate the expected counts for each cell: Make sure they are all greater than 5 to proceed. 3. Calculate the 2 statistic: 4. Compute the p-value as the area in the tail above the 2 statistic, for a 2 distribution with df = (r – 1) (c – 1) 5. Interpret the p-value in context. Statistics: Unlocking the Power of Data Lock 5

Analysis of Variance • Analysis of Variance (ANOVA) compares the variability between groups to

Analysis of Variance • Analysis of Variance (ANOVA) compares the variability between groups to the variability within groups Total Variability Between Groups Statistics: Unlocking the Power of Data Variability Within Groups Lock 5

ANOVA Table Source df Groups k-1 Sum of Squares SSG Error n-k SSE Total

ANOVA Table Source df Groups k-1 Sum of Squares SSG Error n-k SSE Total n-1 SST Statistics: Unlocking the Power of Data Mean F p-value Square Statistic MSG = MSG SSG/(k-1) MSE Use Fk-1, n-k MSE = SSE/(n-k) Lock 5

Simple Linear Regression • Simple linear regression estimates the population model • with the

Simple Linear Regression • Simple linear regression estimates the population model • with the sample model: Statistics: Unlocking the Power of Data Lock 5

Inference for the Slope Statistics: Unlocking the Power of Data Lock 5

Inference for the Slope Statistics: Unlocking the Power of Data Lock 5

Intervals • A confidence interval has a given chance of capturing the mean y

Intervals • A confidence interval has a given chance of capturing the mean y value at a specified x value (the point on the line) • A prediction interval has a given chance of capturing the y value for a particular case at a specified x value (the actual point) Statistics: Unlocking the Power of Data Lock 5

Conditions for SLR Inference based on the simple linear model is only valid if

Conditions for SLR Inference based on the simple linear model is only valid if the following conditions hold: 1) Linearity 2) Constant Variability of Residuals 3) Normality of Residuals Statistics: Unlocking the Power of Data Lock 5

Inference Methods http: //prezi. com/c 1 xz 1 on-p 4 eb/stat-101/ Statistics: Unlocking the

Inference Methods http: //prezi. com/c 1 xz 1 on-p 4 eb/stat-101/ Statistics: Unlocking the Power of Data Lock 5