STATS 330 Lecture 10 9152020 330 lecture 10

Diagnostics 2 Aim of today’s lecture § To describe some more remedies for non-planar

Remedies for non-planar data (cont) § Last time we looked at diagnostics for non-planar

Example: Using theory cherry trees A tree trunk is a bit like a cylinder

Example: cherry trees (cont) > new. reg<- lm(log(volume)~log(diameter)+log(height), data=cherry. df) > summary(new. reg) Call:

Example: cherry trees (original) 9/15/2020 330 lecture 10 6

Example: cherry trees (logs) 9/15/2020 330 lecture 10 7

Tyre abrasion data: gam plots rubber. gam = gam(abloss~s(hardness)+s(tensile), data=rubber. df) par(mfrow=c(1, 2)); plot(rubber.

Tyre abrasion data: polynomial GAM curve is like a polynomial, so fit a polynomial

Why th 4 degree? > rubber. lm = lm(abloss~poly(tensile, 5)+hardness, data=rubber. df) > summary(rubber.

Ladder of powers § Rather than fit polynomials in some independent variables, guided by

Checking for equal scatter § The model specifies that the scatter about the regression

Scatter § Scatter is measured by the size of the residuals § A common

Example: Education expenditure data § Data for the 50 states of the USA §

Outlier! response (response) 9/15/2020 330 lecture 10 15

Outlier, pt 50 (California) 9/15/2020 330 lecture 10 16

Basic fit, outlier in > educ. lm = lm(educ~urban + percap + under 18,

Basic fit, outlier out > educ 50. lm = lm(educ~urban + percap + under

> par(mfrow=c(1, 2)) > plot(educ 50. lm, which = c(1, 3)) Funnel effect 9/15/2020

Remedies § Either Transform the response § Or Estimate the variances of the observations

Transforming the response > tr. educ 50. lm <- lm(I(1/educ)~urban + percap + under

What power to choose? § How did we know to use reciprocals? § Think

Transforming the response (how? ) boxcoxplot(educ~urban + percap + under 18, educ. df[-50, ])

Weighted least squares § Tests are invalid if observations do not have constant variance

Finding the weights § Step 1: Plot the squared residuals versus the fitted values

Doing it in R (1) vars = funnel(educ 50. lm)# a“R 330” function Slope

> educ 50. lm<-lm(educ~urban+percap+under 18 data=educ. df[-50, ]) > summary(educ 50. lm) Note pvalues

> weighted. lm<-lm(educ~urban+percap+under 18, weights=1/vars, data=educ. df[-50, ]) Note > summary(weighted. lm) changes! Note

Slides: 28

Download presentation

STATS 330: Lecture 10 9/15/2020 330 lecture 10 1

Diagnostics 2 Aim of today’s lecture § To describe some more remedies for non-planar data § To look at diagnostics and remedies for non-constant scatter. 9/15/2020 330 lecture 10 2

Remedies for non-planar data (cont) § Last time we looked at diagnostics for non-planar data § We discussed what to do if the diagnostics indicate a problem. § The short answer was: we transform, so that the model fits the transformed data. § How to choose a transformation? • Theory • Ladder of powers • Polynomials § We illustrate with a few examples 9/15/2020 330 lecture 10 3

Example: Using theory cherry trees A tree trunk is a bit like a cylinder Volume = p ´ (diameter/2)2 ´ height Log volume = log(p/4) + 2 log(diameter) + log(height) so a linear regression using the logged variables should work! In fact R 2 increases from 95% to 98%, and residual plots are better 9/15/2020 330 lecture 10 4

Example: cherry trees (cont) > new. reg<- lm(log(volume)~log(diameter)+log(height), data=cherry. df) > summary(new. reg) Call: lm(formula = log(volume) ~ log(diameter) + log(height), data = cherry. df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6. 63162 0. 79979 -8. 292 5. 06 e-09 *** Previously log(diameter) 1. 98265 0. 07501 26. 432 < 2 e-16 *** 94. 8% log(height) 1. 11712 0. 20444 5. 464 7. 81 e-06 *** --Residual standard error: 0. 08139 on 28 degrees of freedom Multiple R-Squared: 0. 9777, Adjusted R-squared: 0. 9761 F-statistic: 613. 2 on 2 and 28 DF, p-value: < 2. 2 e-16 9/15/2020 330 lecture 10 5

Example: cherry trees (original) 9/15/2020 330 lecture 10 6

Example: cherry trees (logs) 9/15/2020 330 lecture 10 7

Tyre abrasion data: gam plots rubber. gam = gam(abloss~s(hardness)+s(tensile), data=rubber. df) par(mfrow=c(1, 2)); plot(rubber. gam) 9/15/2020 330 lecture 10 8

Tyre abrasion data: polynomial GAM curve is like a polynomial, so fit a polynomial (ie include terms tensile 2, tensile 3, …) lm(abloss~hardness+poly(tensile, 4), data=rubber. df) Degree of polynomial Usually a lot of trial and error involved! We have succeeded when • R 2 improves • Residual plots show no pattern 4 th deg polynomial works for the rubber data: R 2 increases from 84% to 94% 9/15/2020 330 lecture 10 9

Why th 4 degree? > rubber. lm = lm(abloss~poly(tensile, 5)+hardness, data=rubber. df) > summary(rubber. lm) Try 5 th degree Coefficients: (Intercept) poly(tensile, poly(tensile, hardness 5)1 5)2 5)3 5)4 5)5 Estimate Std. Error t value Pr(>|t|) 615. 3617 29. 8178 20. 637 2. 44 e-16 *** -264. 3933 25. 0612 -10. 550 2. 76 e-10 *** 23. 6148 25. 3437 0. 932 0. 361129 119. 9500 24. 6356 4. 869 6. 46 e-05 *** -91. 6951 23. 6920 -3. 870 0. 000776 *** 9. 3811 23. 6684 0. 396 0. 695495 -6. 2608 0. 4199 -14. 911 2. 59 e-13 *** Highest significant power Residual standard error: 23. 67 on 23 degrees of freedom Multiple R-squared: 0. 9427, Adjusted R-squared: 0. 9278 F-statistic: 63. 11 on 6 and 23 DF, p-value: 3. 931 e-13 9/15/2020 330 lecture 10 10

Ladder of powers § Rather than fit polynomials in some independent variables, guided by gam plots, we can transform the response using the “ladder of powers” (i. e. use yp as the response rather than y for some power p) § Choose p either by trial and error using R 2 or use a “Box-Cox plot – see later in this lecture 9/15/2020 330 lecture 10 11

Checking for equal scatter § The model specifies that the scatter about the regression plane is uniform § In practice this means that the scatter doesn’t depend on the explanatory variables or the mean of the response § All tests, confidence intervals rely on this 9/15/2020 330 lecture 10 12

Scatter § Scatter is measured by the size of the residuals § A common problem is where the scatter increases as the mean response increases § This is means the big residuals happen when the fitted values are big § Recognize this by a “funnel effect” in the residuals versus fitted value plot 9/15/2020 330 lecture 10 13

Example: Education expenditure data § Data for the 50 states of the USA § Variables are • Per capita expenditure on education (response), variable educ • Per capita Income, variable percap • Number of residents per 1000 under 18, variable under 18 • Number of residents per 1000 in urban areas, variable urban • Fit model educ~ percap+under 18+urban 9/15/2020 330 lecture 10 14

Outlier! response (response) 9/15/2020 330 lecture 10 15

Outlier, pt 50 (California) 9/15/2020 330 lecture 10 16

Basic fit, outlier in > educ. lm = lm(educ~urban + percap + under 18, data=educ. df) > summary(educ. lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -555. 92562 123. 46634 -4. 503 4. 56 e-05 *** urban -0. 00476 0. 05174 -0. 092 0. 927 percap 0. 07236 0. 01165 6. 211 1. 40 e-07 *** under 18 1. 55134 0. 31545 4. 918 1. 16 e-05 *** Residual standard error: 40. 53 on 46 degrees of freedom Multiple R-squared: 0. 5902, Adjusted R-squared: 0. 5634 F-statistic: 22. 08 on 3 and 46 DF, p-value: 5. 271 e-09 R 2 is 59% 9/15/2020 330 lecture 10 17

Basic fit, outlier out > educ 50. lm = lm(educ~urban + percap + under 18, data=educ. df, subset=-50) > summary(educ 50. lm) See how we exclude pt 50 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -278. 06430 132. 61422 -2. 097 0. 041664 * urban 0. 06624 0. 04966 1. 334 0. 188948 percap 0. 04827 0. 01220 3. 958 0. 000266 *** under 18 0. 88983 0. 33159 2. 684 0. 010157 * --Residual standard error: 35. 88 on 45 degrees of freedom Multiple R-squared: 0. 4947, Adjusted R-squared: 0. 461 F-statistic: 14. 68 on 3 and 45 DF, p-value: 8. 365 e-07 R 2 is now 49% 9/15/2020 330 lecture 10 18

> par(mfrow=c(1, 2)) > plot(educ 50. lm, which = c(1, 3)) Funnel effect 9/15/2020 Increasing relationship 330 lecture 10 19

Remedies § Either Transform the response § Or Estimate the variances of the observations and use “weighted least squares” 9/15/2020 330 lecture 10 20

Transforming the response > tr. educ 50. lm <- lm(I(1/educ)~urban + percap + under 18, data=educ. df[-50, ]) > plot(tr. educ 50. lm) Transform to reciprocal Better! 9/15/2020 330 lecture 10 21

What power to choose? § How did we know to use reciprocals? § Think of a more general model I(educ^p)~percap + under 18 + urban where p is some power § Then estimate p from the data using a Box. Cox plot 9/15/2020 330 lecture 10 22

Transforming the response (how? ) boxcoxplot(educ~urban + percap + under 18, educ. df[-50, ]) Draws “Box-Cox plot” A “R 330” function Min at about -1 9/15/2020 330 lecture 10 23

Weighted least squares § Tests are invalid if observations do not have constant variance § If the ith observation has variance vis 2, then we can get a valid test by using “weighted least squares”, minimising the sum of the weighted squared residuals Sri 2/vi rather than the sum of squared residuals Sri 2 § Need to know the variances vi 9/15/2020 330 lecture 10 24

Finding the weights § Step 1: Plot the squared residuals versus the fitted values § Step 2: Smooth the plot § Step 3: Estimate the variance of an observation by the smoothed squared residual § Step 4: weight is reciprocal of smoothed squared residual § Rationale: variance is a function of the mean § Use “R 330” function funnel 9/15/2020 330 lecture 10 25

Doing it in R (1) vars = funnel(educ 50. lm)# a“R 330” function Slope 1. 7, indicates p=1 -1. 7=-0. 7 9/15/2020 330 lecture 10 26

> educ 50. lm<-lm(educ~urban+percap+under 18 data=educ. df[-50, ]) > summary(educ 50. lm) Note pvalues Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -278. 06430 132. 61422 -2. 097 0. 041664 * urban 0. 06624 0. 04966 1. 334 0. 188948 percap 0. 04827 0. 01220 3. 958 0. 000266 *** under 18 0. 88983 0. 33159 2. 684 0. 010157 * --Signif. codes: 0 `***' 0. 001 `**' 0. 01 `*' 0. 05 `. ' 0. 1 ` ' 1 Residual standard error: 35. 88 on 45 degrees of freedom Multiple R-Squared: 0. 4947, Adjusted R-squared: 0. 461 F-statistic: 14. 68 on 3 and 45 DF, p-value: 8. 365 e-07 9/15/2020 330 lecture 10 27

> weighted. lm<-lm(educ~urban+percap+under 18, weights=1/vars, data=educ. df[-50, ]) Note > summary(weighted. lm) changes! Note reciprocals! Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -270. 29363 102. 61073 -2. 634 0. 0115 * urban 0. 01197 0. 04030 0. 297 0. 7677 percap 0. 05850 0. 01027 5. 694 8. 88 e-07 *** under 18 0. 82384 0. 27234 3. 025 0. 0041 ** --Signif. codes: 0 ‘***’ 0. 001 ‘**’ 0. 01 ‘*’ 0. 05 ‘. ’ 0. 1 ‘ ’ 1 Residual standard error: 1. 019 on 45 degrees of freedom Multiple R-squared: 0. 629, Adjusted R-squared: 0. 6043 F-statistic: 25. 43 on 3 and 45 DF, p-value: 8. 944 e-10 Conclusion: unequal variances matter! Can change results! 9/15/2020 330 lecture 10 28