STATS 330 Lecture 10 9152020 330 lecture 10
- Slides: 28
STATS 330: Lecture 10 9/15/2020 330 lecture 10 1
Diagnostics 2 Aim of today’s lecture § To describe some more remedies for non-planar data § To look at diagnostics and remedies for non-constant scatter. 9/15/2020 330 lecture 10 2
Remedies for non-planar data (cont) § Last time we looked at diagnostics for non-planar data § We discussed what to do if the diagnostics indicate a problem. § The short answer was: we transform, so that the model fits the transformed data. § How to choose a transformation? • Theory • Ladder of powers • Polynomials § We illustrate with a few examples 9/15/2020 330 lecture 10 3
Example: Using theory cherry trees A tree trunk is a bit like a cylinder Volume = p ´ (diameter/2)2 ´ height Log volume = log(p/4) + 2 log(diameter) + log(height) so a linear regression using the logged variables should work! In fact R 2 increases from 95% to 98%, and residual plots are better 9/15/2020 330 lecture 10 4
Example: cherry trees (cont) > new. reg<- lm(log(volume)~log(diameter)+log(height), data=cherry. df) > summary(new. reg) Call: lm(formula = log(volume) ~ log(diameter) + log(height), data = cherry. df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6. 63162 0. 79979 -8. 292 5. 06 e-09 *** Previously log(diameter) 1. 98265 0. 07501 26. 432 < 2 e-16 *** 94. 8% log(height) 1. 11712 0. 20444 5. 464 7. 81 e-06 *** --Residual standard error: 0. 08139 on 28 degrees of freedom Multiple R-Squared: 0. 9777, Adjusted R-squared: 0. 9761 F-statistic: 613. 2 on 2 and 28 DF, p-value: < 2. 2 e-16 9/15/2020 330 lecture 10 5
Example: cherry trees (original) 9/15/2020 330 lecture 10 6
Example: cherry trees (logs) 9/15/2020 330 lecture 10 7
Tyre abrasion data: gam plots rubber. gam = gam(abloss~s(hardness)+s(tensile), data=rubber. df) par(mfrow=c(1, 2)); plot(rubber. gam) 9/15/2020 330 lecture 10 8
Tyre abrasion data: polynomial GAM curve is like a polynomial, so fit a polynomial (ie include terms tensile 2, tensile 3, …) lm(abloss~hardness+poly(tensile, 4), data=rubber. df) Degree of polynomial Usually a lot of trial and error involved! We have succeeded when • R 2 improves • Residual plots show no pattern 4 th deg polynomial works for the rubber data: R 2 increases from 84% to 94% 9/15/2020 330 lecture 10 9
Why th 4 degree? > rubber. lm = lm(abloss~poly(tensile, 5)+hardness, data=rubber. df) > summary(rubber. lm) Try 5 th degree Coefficients: (Intercept) poly(tensile, poly(tensile, hardness 5)1 5)2 5)3 5)4 5)5 Estimate Std. Error t value Pr(>|t|) 615. 3617 29. 8178 20. 637 2. 44 e-16 *** -264. 3933 25. 0612 -10. 550 2. 76 e-10 *** 23. 6148 25. 3437 0. 932 0. 361129 119. 9500 24. 6356 4. 869 6. 46 e-05 *** -91. 6951 23. 6920 -3. 870 0. 000776 *** 9. 3811 23. 6684 0. 396 0. 695495 -6. 2608 0. 4199 -14. 911 2. 59 e-13 *** Highest significant power Residual standard error: 23. 67 on 23 degrees of freedom Multiple R-squared: 0. 9427, Adjusted R-squared: 0. 9278 F-statistic: 63. 11 on 6 and 23 DF, p-value: 3. 931 e-13 9/15/2020 330 lecture 10 10
Ladder of powers § Rather than fit polynomials in some independent variables, guided by gam plots, we can transform the response using the “ladder of powers” (i. e. use yp as the response rather than y for some power p) § Choose p either by trial and error using R 2 or use a “Box-Cox plot – see later in this lecture 9/15/2020 330 lecture 10 11
Checking for equal scatter § The model specifies that the scatter about the regression plane is uniform § In practice this means that the scatter doesn’t depend on the explanatory variables or the mean of the response § All tests, confidence intervals rely on this 9/15/2020 330 lecture 10 12
Scatter § Scatter is measured by the size of the residuals § A common problem is where the scatter increases as the mean response increases § This is means the big residuals happen when the fitted values are big § Recognize this by a “funnel effect” in the residuals versus fitted value plot 9/15/2020 330 lecture 10 13
Example: Education expenditure data § Data for the 50 states of the USA § Variables are • Per capita expenditure on education (response), variable educ • Per capita Income, variable percap • Number of residents per 1000 under 18, variable under 18 • Number of residents per 1000 in urban areas, variable urban • Fit model educ~ percap+under 18+urban 9/15/2020 330 lecture 10 14
Outlier! response (response) 9/15/2020 330 lecture 10 15
Outlier, pt 50 (California) 9/15/2020 330 lecture 10 16
Basic fit, outlier in > educ. lm = lm(educ~urban + percap + under 18, data=educ. df) > summary(educ. lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -555. 92562 123. 46634 -4. 503 4. 56 e-05 *** urban -0. 00476 0. 05174 -0. 092 0. 927 percap 0. 07236 0. 01165 6. 211 1. 40 e-07 *** under 18 1. 55134 0. 31545 4. 918 1. 16 e-05 *** Residual standard error: 40. 53 on 46 degrees of freedom Multiple R-squared: 0. 5902, Adjusted R-squared: 0. 5634 F-statistic: 22. 08 on 3 and 46 DF, p-value: 5. 271 e-09 R 2 is 59% 9/15/2020 330 lecture 10 17
Basic fit, outlier out > educ 50. lm = lm(educ~urban + percap + under 18, data=educ. df, subset=-50) > summary(educ 50. lm) See how we exclude pt 50 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -278. 06430 132. 61422 -2. 097 0. 041664 * urban 0. 06624 0. 04966 1. 334 0. 188948 percap 0. 04827 0. 01220 3. 958 0. 000266 *** under 18 0. 88983 0. 33159 2. 684 0. 010157 * --Residual standard error: 35. 88 on 45 degrees of freedom Multiple R-squared: 0. 4947, Adjusted R-squared: 0. 461 F-statistic: 14. 68 on 3 and 45 DF, p-value: 8. 365 e-07 R 2 is now 49% 9/15/2020 330 lecture 10 18
> par(mfrow=c(1, 2)) > plot(educ 50. lm, which = c(1, 3)) Funnel effect 9/15/2020 Increasing relationship 330 lecture 10 19
Remedies § Either Transform the response § Or Estimate the variances of the observations and use “weighted least squares” 9/15/2020 330 lecture 10 20
Transforming the response > tr. educ 50. lm <- lm(I(1/educ)~urban + percap + under 18, data=educ. df[-50, ]) > plot(tr. educ 50. lm) Transform to reciprocal Better! 9/15/2020 330 lecture 10 21
What power to choose? § How did we know to use reciprocals? § Think of a more general model I(educ^p)~percap + under 18 + urban where p is some power § Then estimate p from the data using a Box. Cox plot 9/15/2020 330 lecture 10 22
Transforming the response (how? ) boxcoxplot(educ~urban + percap + under 18, educ. df[-50, ]) Draws “Box-Cox plot” A “R 330” function Min at about -1 9/15/2020 330 lecture 10 23
Weighted least squares § Tests are invalid if observations do not have constant variance § If the ith observation has variance vis 2, then we can get a valid test by using “weighted least squares”, minimising the sum of the weighted squared residuals Sri 2/vi rather than the sum of squared residuals Sri 2 § Need to know the variances vi 9/15/2020 330 lecture 10 24
Finding the weights § Step 1: Plot the squared residuals versus the fitted values § Step 2: Smooth the plot § Step 3: Estimate the variance of an observation by the smoothed squared residual § Step 4: weight is reciprocal of smoothed squared residual § Rationale: variance is a function of the mean § Use “R 330” function funnel 9/15/2020 330 lecture 10 25
Doing it in R (1) vars = funnel(educ 50. lm)# a“R 330” function Slope 1. 7, indicates p=1 -1. 7=-0. 7 9/15/2020 330 lecture 10 26
> educ 50. lm<-lm(educ~urban+percap+under 18 data=educ. df[-50, ]) > summary(educ 50. lm) Note pvalues Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -278. 06430 132. 61422 -2. 097 0. 041664 * urban 0. 06624 0. 04966 1. 334 0. 188948 percap 0. 04827 0. 01220 3. 958 0. 000266 *** under 18 0. 88983 0. 33159 2. 684 0. 010157 * --Signif. codes: 0 `***' 0. 001 `**' 0. 01 `*' 0. 05 `. ' 0. 1 ` ' 1 Residual standard error: 35. 88 on 45 degrees of freedom Multiple R-Squared: 0. 4947, Adjusted R-squared: 0. 461 F-statistic: 14. 68 on 3 and 45 DF, p-value: 8. 365 e-07 9/15/2020 330 lecture 10 27
> weighted. lm<-lm(educ~urban+percap+under 18, weights=1/vars, data=educ. df[-50, ]) Note > summary(weighted. lm) changes! Note reciprocals! Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -270. 29363 102. 61073 -2. 634 0. 0115 * urban 0. 01197 0. 04030 0. 297 0. 7677 percap 0. 05850 0. 01027 5. 694 8. 88 e-07 *** under 18 0. 82384 0. 27234 3. 025 0. 0041 ** --Signif. codes: 0 ‘***’ 0. 001 ‘**’ 0. 01 ‘*’ 0. 05 ‘. ’ 0. 1 ‘ ’ 1 Residual standard error: 1. 019 on 45 degrees of freedom Multiple R-squared: 0. 629, Adjusted R-squared: 0. 6043 F-statistic: 25. 43 on 3 and 45 DF, p-value: 8. 944 e-10 Conclusion: unequal variances matter! Can change results! 9/15/2020 330 lecture 10 28
- Stats 330
- Stats 330
- Stats 330
- 01:640:244 lecture notes - lecture 15: plat, idah, farad
- Hurricane florence statistics
- Ogive definition
- When we take a census, we attempt to collect data from
- Chapter 16 ap stats
- Stats-346
- Mcommerce stats
- Experimental unit stats
- Stats refund
- Statistical tests psychology
- Stats
- Stats for beginners
- Ap statistics experimental design
- Stats p hat
- Standard error of difference between two proportions
- Parameter example statistics
- Stats
- Stats
- Excel
- Ap stats chapter 10 understanding randomness
- Webmetrics integration
- Stats tree diagram
- Ap statistics chapter 7 test
- Freesurfer color lut
- Stat 134
- Privisol