Chapter 23 Inference for Regression BPS 5 th

Chapter 23 Inference for Regression BPS - 5 th Ed. Chapter 23 1

Linear Regression (from Chapter 5) u. Objective: To quantify the linear relationship between an explanatory variable (x) and response variable (y). u. We can then predict the average response for all subjects with a given value of the explanatory variable. BPS - 5 th Ed. Chapter 23 2

Case Study Crying and IQ Karelitz, S. et al. , “Relation of crying activity in early infancy to speech and intellectual development at age three years, ” Child Development, 35 (1964), pp. 769 -777. Researchers explored the crying of infants four to ten days old and their IQ test scores at age three to determine if more crying was a sign of higher IQ BPS - 5 th Ed. Chapter 23 3

Case Study Crying and IQ Data collection u Data collected on 38 infants u Snap of rubber band on foot caused infants to cry – recorded the number of peaks in the most active 20 seconds of crying (explanatory variable x) u Measured IQ score at age three years using the Stanford-Binet IQ test (response variable y) BPS - 5 th Ed. Chapter 23 4

Case Study Crying and IQ Data TABLE 23. 1 BPS - 5 th Ed. Chapter 23 5

Case Study Crying and IQ Data analysis Scatterplot of y vs. x shows a moderate positive linear relationship, with no extreme outliers or potential influential observations BPS - 5 th Ed. Chapter 23 6

Case Study Crying and IQ Data analysis u Correlation between crying and IQ is r = 0. 455 (from formula in Chapter 4) u Least-squares regression line for predicting IQ from crying is (as in Ch. 5) u R 2 = 0. 207, so 21% of the variation in IQ scores is explained by crying intensity BPS - 5 th Ed. Chapter 23 7

Inference u We now want to extend our analysis to include inferences on various components involved in the regression analysis – slope – intercept – correlation – predictions BPS - 5 th Ed. Chapter 23 8

Regression Model, Assumptions u Conditions required for inference about regression (have n observations on an explanatory variable x and a response variable y) 1. for any fixed value of x, the response y varies according to a Normal distribution. Repeated responses y are independent of each other. 2. the mean response µy has a straight-line relationship with x: µy = + x. The slope and intercept are unknown parameters. 3. the standard deviation of y (call it ) is the same for all values of x. The value of is unknown. BPS - 5 th Ed. Chapter 23 9

Regression Model, Assumptions regression model has three parameters: , , and u the true regression line µy = + x says that the mean response µy moves along a straight line as x changes (we cannot observe the true regression line; instead we observe y for various values of x) u observed values of y vary about their means µy according to a Normal distribution (if we take many y observations at a fixed value of x, the Normal pattern will appear for these y values) u the BPS - 5 th Ed. Chapter 23 10

Regression Model, Assumptions standard deviation is the same for all values of x, meaning the Normal distributions for y have the same spread at each value of x u the BPS - 5 th Ed. Chapter 23 11

Estimating Parameters: Slope and Intercept When using the least-squares regression line , the slope b is an unbiased estimator of the true slope , and the intercept a is an unbiased estimator of the true intercept BPS - 5 th Ed. Chapter 23 12

Estimating Parameters: Standard Deviation standard deviation describes the variability of the response y about the true regression line u a residual is the difference between an observed value of y and the value predicted by the leastsquares regression line: u the standard deviation is estimated with a sample standard deviation of the residuals (this is a standard error since it is estimated from data) u the BPS - 5 th Ed. Chapter 23 13

Estimating Parameters: Standard Deviation The regression standard error is the square root of the sum of squared residuals divided by their degrees of freedom (n 2): BPS - 5 th Ed. Chapter 23 14

Case Study Crying and IQ u Since , b = 1. 493 is an unbiased estimator of the true slope , and a = 91. 27 is an unbiased estimator of the true intercept – because the slope b = 1. 493, we estimate that on the average IQ is about 1. 5 points higher for each added crying peak. u The regression standard error is s = 17. 50 – see pages 600 -601 in the text for this calculation BPS - 5 th Ed. Chapter 23 15

Hypothesis Tests for Slope (test for no linear relationship) u The most common hypothesis to test regarding the slope is that it is zero: H 0: = 0 – says regression line is horizontal (the mean of y does not change with x) – no true linear relationship between x and y – the straight-line dependence on x is of no value for predicting y u Standardize BPS - 5 th Ed. b to get a t test statistic: Chapter 23 16

Hypothesis Tests for Slope u the standard error of b is a multiple of the regression standard error: u Test statistic for H 0: = 0 : – follows t distribution with df = n 2 BPS - 5 th Ed. Chapter 23 17

Hypothesis Tests for Slope u P-value: [for T ~ t(n 2) distribution] Ha: > 0 : P-value = P(T t) Ha: < 0 : P-value = P(T t) Ha: 0 : P-value = 2 P(T |t|) BPS - 5 th Ed. Chapter 23 18

Case Study Crying and IQ Hypothesis Test for slope t = b / SEb = 1. 4929 / 0. 4870 = 3. 07 P-value Significant linear relationship BPS - 5 th Ed. Chapter 23 19

Test for Correlation u The correlation between x and y is closely related to the slope (for both the population and the observed data) – in particular, the correlation is 0 exactly when the slope is 0 testing H 0: = 0 is equivalent to testing that there is no correlation between x and y in the population from which the data were drawn u Therefore, BPS - 5 th Ed. Chapter 23 20

Test for Correlation u The t-score simplifies to: u Degrees BPS - 5 th Ed. of freedom: n-2 Chapter 23 21

Test for Correlation u There does exist a test for correlation that does not require a regression analysis – Table E on page 695 of the text gives critical values and upper tail probabilities for the sample correlation r under the null hypothesis that the correlation is 0 in the population v look up n and r in the table (if r is negative, look up its positive value), and read off the associated probability from the top margin of the table to obtain the P-value just as is done for the t table (Table C) BPS - 5 th Ed. Chapter 23 22

Case Study Crying and IQ Test for H 0: correlation = 0 u Correlation between crying and IQ is r = 0. 455 u Sample size is n=38 u From Table E: for Ha: correlation > 0 , the P-value is between. 001 and. 0025 (using n=40) – P-value for two-sided test is between. 002 and. 005 (matches two-sided P-value for test on slope) – one-sided P-value would be between. 005 and. 01 if we were very conservative and used n=30 BPS - 5 th Ed. Chapter 23 23

Confidence Interval for Slope u. A level C confidence interval for the true slope is b t* SEb – t* is the critical value for the t distribution with df = n 2 degrees of freedom that has area (1 C)/2 to the right of it – recall, the standard error of b is a multiple of the regression standard error: BPS - 5 th Ed. Chapter 23 24

Case Study Crying and IQ Confidence interval for slope b=1. 4929, SEb = 0. 4870, df = n 2 = 38 2 = 36 (df = 36 is not in Table C, so use next smaller df = 30) For a 95% C. I. , (1 C)/2 =. 025, and t* = 2. 042 So a 95% C. I. for the true slope is: b t* SEb = 1. 4929 2. 042(0. 4870) = 1. 4929 0. 9944 = 0. 4985 to 2. 4873 BPS - 5 th Ed. Chapter 23 25

Inference about Prediction u Once a regression line is fit to the data, it is useful to obtain a prediction of the response for a particular value of the explanatory variable ( x* ); this is done by substituting the value of x* into the equation of the line ( ) for x in order to calculate the predicted value u We now present confidence intervals that describe how accurate this prediction is BPS - 5 th Ed. Chapter 23 26

Inference about Prediction u There are two types of predictions – predicting the mean response of all subjects with a certain value x* of the explanatory variable – predicting the individual response for one subject with a certain value x* of the explanatory variable u Predicted values ( ) are the same for each case, but the margin of error is different BPS - 5 th Ed. Chapter 23 27

Inference about Prediction u To estimate the mean response µy, use an ordinary confidence interval for the parameter µy = + x* – µy is the mean of responses y when x = x* – 95% confidence interval: in repeated samples of n observations, 95% of the confidence intervals calculated (at x*) from these samples will contain the true value of µy at x* BPS - 5 th Ed. Chapter 23 28

Inference about Prediction u To estimate an individual response y, use a prediction interval – estimates a single random response y rather than a parameter like µy – 95% prediction interval: take an observation on y for each of the n values of x in the original data, then take one more observation y at x = x*; the prediction interval from the n observations will cover the one more y in 95% of all repetitions BPS - 5 th Ed. Chapter 23 29

Inference about Prediction BPS - 5 th Ed. Chapter 23 30

Inference about Prediction u Both confidence interval and prediction interval have the same form. – both t* values have df = n 2 – the standard errors (SE) differ for the two intervals (formulas on previous slide) v BPS - 5 th Ed. the prediction interval is wider than the confidence interval Chapter 23 31

Residual Plots x = number of beers y = blood alcohol Residuals: -2 731 -1 871 -0 91 0 5578 1 1 2 39 3 (4|1 =. 041) 4 1 (close to Normal) Roughly linear relationship; spread is even across entire data range (‘random’ scatter about zero) BPS - 5 th Ed. Chapter 23 32

Residual Plots ‘x’ = collection of explanatory variables, y = salary of player Standard deviation is not constant everywhere (more variation among players with higher salaries) BPS - 5 th Ed. Chapter 23 33

Residual Plots x = number of years, y = logarithm of salary of player A clear curved pattern – relationship is not linear BPS - 5 th Ed. Chapter 23 34

Checking Assumptions u Independent observations – no repeated observations on the same individual u True relationship is linear – look at scatterplot to check overall pattern – plot of residuals against x magnifies any unusual pattern (should see ‘random’ scatter about zero) BPS - 5 th Ed. Chapter 23 35

Checking Assumptions u Constant standard deviation σ of the response at all x values – scatterplot: spread of data points about the regression line should be similar over the entire range of the data – easier to see with a plot of residuals against x, with a horizontal line drawn at zero (should see ‘random’ scatter about zero) (or plot residuals against for linear regr. ) BPS - 5 th Ed. Chapter 23 36

Checking Assumptions u Response y varies Normally about the true regression line – residuals estimate the deviations of the response from the true regression line, so they should follow a Normal distribution v make histogram or stemplot of the residuals and check for clear skewness or other departures from Normality – numerous methods for carefully checking Normality exists (talk to a statistician!) BPS - 5 th Ed. Chapter 23 37