Cautions about Regression Cautionary Notes Regarding Regression Do
Cautions about Regression
Cautionary Notes Regarding Regression � Do not use linear models to describe non-linear associations. � Correlation � Don’t is not causation! extrapolate! � Beware of influential points
Association versus Causation � An association between x and y, even if it is very strong, is not by itself good evidence that changes in x actually cause changes in y. � Our outcomes could be influenced by a confounding (lurking) variable � An experiment that controls confounding variables is best for establishing causation.
Extrapolation � Recall we predicted the time of game when 8 pitchers were used: � What if 20 pitchers were used? � Our Data on goes from 2 -14 pitchers Used � Notes: ◦ We are not sure that the linear trend will continue beyond the range of the data ◦ Often the y-intercept is extrapolation
Influential Observations � An Influential Observation is an observation whose deletion would drastically change the regression line. Approximate Regression line w/o influential point Approximate Regression line w/ influential point Likely an outlier in Y, but not Influential point Influential Point (Does not fit Relationship)
Influential observations vs. Outliers � We saw that just because an outlier (in x or y) or an influential point do not imply each other. � Can a point be both and outlier and influential point? 6
Residuals � Remember we use the line to predict y from x. � Error � Also � The = observed y – predicted y called the residual least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
Calculating residuals � Predict the game time if 12 total pitchers were used. 99. 397149 + 9. 8740778 (12) = 217. 885 mins � An actual game that used 12 pitchers (4/25/2015 Min @ Sea) was 206 minutes long � The error in our prediction (residual) is: 217. 885 - 206 = 11. 885 8
Standardizing Residuals � How will be in units of y and will be relative to the data do we know something is unusual? � Most software will calculate a “standardized residual” similar to a “Z-score” for residuals ◦ Outside of ± 2 (95%) we can deem “unusual” and may be worth investigation ◦ Outside of ± 3 (99. 7%) is very unusual and may be cause for concern. ◦ Could also see studentized (t) residuals that behave similarly 9
Residuals in Minitab � In Minitab: ◦ Store Model: �Stat Regression -> Fit Regression Model �Choose the response, y and predictors, x variables. ◦ The default output will alert you to any unusual observations 10
Residuals in Minitab cont… � We can also have them calculate and print all residuals � In Minitab: ◦ Store Model: �Stat Regression -> Fit Regression Model �Choose the response, y and predictors, x variables. �Click results button and choose “For all Observations” in Fits and Diagnostics drop down menu 11
- Slides: 11