Simple Linear Regression Conditions Confidence intervals Prediction intervals

  • Slides: 42
Download presentation
Simple Linear Regression • Conditions • Confidence intervals • Prediction intervals Section 9. 1,

Simple Linear Regression • Conditions • Confidence intervals • Prediction intervals Section 9. 1, 9. 2, 9. 3 Professor Kari Lock Morgan Duke University

To Do • Homework 8 (due Monday, 4/9) • Project 2 Proposal (due Wednesday,

To Do • Homework 8 (due Monday, 4/9) • Project 2 Proposal (due Wednesday, 4/11)

Hypothesis Test > 2*pt(16. 21, 5, lower. tail=FALSE) [1] 1. 628701 e-05 There is

Hypothesis Test > 2*pt(16. 21, 5, lower. tail=FALSE) [1] 1. 628701 e-05 There is strong evidence that the slope is significantly different from 0, and that there is an association between cricket chirp rate and temperature.

Correlation Test for a correlation between temperature and cricket chirps (r = 0. 9906).

Correlation Test for a correlation between temperature and cricket chirps (r = 0. 9906). > 2*pt(16. 21, 5, lower. tail=FALSE) [1] 1. 628701 e-05

Two Quantitative Variables • The t-statistic (and p-value) for a test for a non-zero

Two Quantitative Variables • The t-statistic (and p-value) for a test for a non-zero slope and a test for a non -zero correlation are identical! • They are equivalent ways of testing for an association between two quantitative variables.

Simple Linear Regression • Simple linear regression estimates the population model • with the

Simple Linear Regression • Simple linear regression estimates the population model • with the sample model:

Conditions Inference based on the simple linear model is only valid if the following

Conditions Inference based on the simple linear model is only valid if the following conditions hold: 1) 2) 3) 4) Linearity Constant Variability of Residuals Normality of Residuals Independence

Linearity • The relationship between x and y is linear (it makes sense to

Linearity • The relationship between x and y is linear (it makes sense to draw a line through the scatterplot)

Dog Years Happy Birthday Charlie! • From www. dogyears. com: “The old rule-of-thumb that

Dog Years Happy Birthday Charlie! • From www. dogyears. com: “The old rule-of-thumb that one dog year equals seven years of a human life is not accurate. The ratio is higher with youth and decreases a bit as the dog ages. ” • 1 dog year = 7 human years • Linear: human age = 7×dog age ACTUAL LINEAR A linear model can still be useful, even if it doesn’t perfectly fit the data.

“All models are wrong, but some are useful” -George Box

“All models are wrong, but some are useful” -George Box

Simple Linear Model

Simple Linear Model

Residual Plot • A residual plot is a scatterplot of the residuals against the

Residual Plot • A residual plot is a scatterplot of the residuals against the predicted responses Should have: 1) No obvious pattern or trend 2) Constant variability

Residual Plots Obvious pattern Variability not constant

Residual Plots Obvious pattern Variability not constant

Residuals (errors) Conditions for residuals: The errors are normally distributed Check with a histogram

Residuals (errors) Conditions for residuals: The errors are normally distributed Check with a histogram The average of the errors is 0 (Always true for least squares regression) The standard deviation of the errors is constant for all cases Constant vertical spread in the residual plot

Independence • The independence assumption is that the residuals are independent from case to

Independence • The independence assumption is that the residuals are independent from case to case • The residual of one case is not associated with the residual of another case • This condition is confirmed by thinking about the data, not necessarily by looking at plot • Can you think of data for which the independence assumption doesn’t hold?

Data over Time • Data collected over time usually violates the independent condition, because

Data over Time • Data collected over time usually violates the independent condition, because the residual of one case may be similar to the residuals of the cases directly before or after • Dow Jones index over the past month:

Conditions not Met? • If the association isn’t linear: Try to make it linear

Conditions not Met? • If the association isn’t linear: Try to make it linear (next week) If can’t make linear, then simple linear regression isn’t a good fit for the data • If variability is not constant, residuals are not normal, or cases are not independent: The model itself is still valid, but inference may not be accurate

Simple Linear Regression 1) Plot your data! (linear association? ) 2) Fit the model

Simple Linear Regression 1) Plot your data! (linear association? ) 2) Fit the model (least squares) 3) Use the model (interpret coefficients and/or make predictions) 4) Look at residual plot (no obvious pattern? constant variability? ) 5) Look at histogram of residuals (normal? ) 6) Cases independent? 7) Inference (extend to population)

President Approval and Re-Election • Can we use Obama’s approval rating to predict his

President Approval and Re-Election • Can we use Obama’s approval rating to predict his margin of victory (or defeat) when he runs for re-election? • Data on all* incumbent U. S. presidential candidates since 1940 (11 cases) • Response: margin of victory (or defeat) • Explanatory: approval rating *Except 1944 because Gallup went on a wartime hiatus Source: Silver, Nate, “Approval Ratings and Re-Election Odds", fivethirtyeight. com, posted on 1/28/11.

President Approval and Re-Election 1. Plot the data: Is the trend linear? (a) Yes

President Approval and Re-Election 1. Plot the data: Is the trend linear? (a) Yes (b) No

President Approval and Re-Election 2. Fit the Model:

President Approval and Re-Election 2. Fit the Model:

President Approval and Re-Election 3. Use the model: Which of the following is a

President Approval and Re-Election 3. Use the model: Which of the following is a correct interpretation? a) For every percentage point increase in margin of victory, approval increases by 0. 84 percentage points b) For every percentage point increase in approval, predicted margin of victory increases by 0. 84 percentage points c) For every 0. 84 increase in approval, predicted margin of victory increases by 1

President Approval and Re-Election 3. Use the model: Obama’s current (based on polls 3/26/12

President Approval and Re-Election 3. Use the model: Obama’s current (based on polls 3/26/12 – 4/1/12) approval rating is 46%. Based only on this information, do you think Obama will a) Win b) Lose Predicted margin of victory = -36. 5 + 0. 84*46 = 2. 14, which is positive Source: http: //www. gallup. com/poll/116479/barack-obamapresidential-job-approval. aspx

President Approval and Re-Election 4. Look at residual plot: Is there no obvious trend?

President Approval and Re-Election 4. Look at residual plot: Is there no obvious trend? (a) Yes (b) No Is the variability constant? (a) Yes (b) No

President Approval and Re-Election 5. Look at histogram of residuals: Are the residuals approximately

President Approval and Re-Election 5. Look at histogram of residuals: Are the residuals approximately normally distributed? (a) Yes (b) No

President Approval and Re-Election 6. Cases independent? Are the cases independent? a) Yes b)

President Approval and Re-Election 6. Cases independent? Are the cases independent? a) Yes b) No

President Approval and Re-Election 7. Inference Should we do inference? (a) Yes (b) No

President Approval and Re-Election 7. Inference Should we do inference? (a) Yes (b) No Due to the non-constant variability, the nonnormal residuals, and the small sample size, our inferences may not be entirely accurate. However, they are still better than nothing!

President Approval and Re-Election 7. Inference Give a 95% confidence interval for the slope

President Approval and Re-Election 7. Inference Give a 95% confidence interval for the slope coefficient. Is it significantly different than 0? (a) Yes (b) No

President Approval and Re-Election 7. Inference: We don’t really care about the slope coefficient,

President Approval and Re-Election 7. Inference: We don’t really care about the slope coefficient, we care about Obama’s margin of victory in the upcoming election! How do we create a 95% interval predicting Obama’s margin of victory? ? ?

Prediction • We would like to use the regression equation to predict y for

Prediction • We would like to use the regression equation to predict y for a certain value of x • This includes not only a point estimate, but interval estimates also • We will predict the value of y at x = x*

Point Estimate • The point estimate for the average y value at x=x* is

Point Estimate • The point estimate for the average y value at x=x* is simply the predicted value: • Alternatively, you can think of it as the value on the line above the x value • The uncertainty in this point estimate comes from the uncertainty in the coefficients

Confidence Intervals • We can calculate a confidence interval for the average y value

Confidence Intervals • We can calculate a confidence interval for the average y value for a certain x value “We are 95% confident that the average y value for x=x* lies in this interval” • Equivalently, the confidence interval is for the point estimate, or the predicted value • This is the amount the line is free to “wiggle, ” and the width of the interval decreases as the sample size increases

Bootstrapping • We need a way to assess the uncertainty in predicted y values

Bootstrapping • We need a way to assess the uncertainty in predicted y values for a certain x value… any ideas? • Take repeated samples, with replacement, from the original sample data (bootstrap) • Each sample gives a slightly different fitted line • If we do this repeatedly, take the middle P% of predicted y values at x* for a confidence interval of the predicted y value at x*

Bootstrap CI Middle 95% of predicted values gives the confidence interval for average (predicted)

Bootstrap CI Middle 95% of predicted values gives the confidence interval for average (predicted) margin of victory for an incumbent president with an approval rating of 46%

Confidence Interval

Confidence Interval

Confidence Interval • For x = 46%: (-2. 89, 6. 80) • For an

Confidence Interval • For x = 46%: (-2. 89, 6. 80) • For an approval rating of 46%, we are 95% confident that the average margin of victory for incumbent U. S. presidents is between -2. 89 and 6. 80 percentage points • But wait, this still doesn’t tell us about Obama! We don’t care about the average, we care about an interval for one incumbent president with an approval rating of 46%!

Prediction Intervals • We can also calculate a prediction interval for y values for

Prediction Intervals • We can also calculate a prediction interval for y values for a certain x value “We are 95% confident that the y value for x = x* lies in this interval” • This takes into account the variability in the line (in the predicted value) AND the uncertainty around the line (the random errors)

Intervals

Intervals

Intervals • A confidence interval has a given chance of capturing the mean y

Intervals • A confidence interval has a given chance of capturing the mean y value at a specified x value • A prediction interval has a given chance of capturing the y value for a particular case at a specified x value • For a given x value, which will be wider? a) Confidence interval b) Prediction interval

Intervals --- Prediction --- Confidence

Intervals --- Prediction --- Confidence

Intervals • As the sample size increases: • the standard errors of the coefficients

Intervals • As the sample size increases: • the standard errors of the coefficients decrease • we are more sure of the equation of the line • the widths of the confidence intervals decrease • for a huge n, the width of the CI will be almost 0 • The prediction interval may be wide, even for large n, and depends more on the correlation between x and y (how well y can be linearly predicted by x)

Prediction Interval • Based on the data and the simple linear model (and using

Prediction Interval • Based on the data and the simple linear model (and using Obama’s current approval rating): • The predicted margin of victory for Obama is 2. 14 percentage points • We are 95% confident that Obama’s margin of victory (or defeat) will be between -12. 36 and 16. 26 percentage points