Linear regression models Simple Linear Regression History Developed































- Slides: 31

Linear regression models

Simple Linear Regression

History • Developed by Sir Francis Galton (18221911) in his article “Regression towards mediocrity in hereditary structure”

Purposes: • To describe the linear relationship between two continuous variables, the response variable (yaxis) and a single predictor variable (x-axis) • To determine how much of the variation in Y can be explained by the linear relationship with X and how much of this relationship remains unexplained • To predict new values of Y from new values of X

The linear regression model is: • Xi and Yi are paired observations (i = 1 to n) • β 0 = population intercept (when Xi =0) • β 1 = population slope (measures the change in Yi per unit change in Xi) • εi = the random or unexplained error associated with the i th observation. The εi are assumed to be independent and distributed as N(0, σ2).

Linear relationship Y ß 1 1. 0 ß 0 X

Linear models approximate non-linear functions over a limited domain extrapolation interpolation extrapolation

• For a given value of X, the sampled Y values are independent with normally distributed errors: Y = β + β *X + ε i o 1 i i ε ~ N(0, σ2) E(εi) = 0 E(Yi ) = βo + β 1*Xi Y E(Y 2) E(Y 1) X 1 X X 2

Fitting data to a linear model: Yi Yi – Ŷi = εi (residual) Ŷi Xi

The residual sum of squares

Estimating Regression Parameters • The “best fit” estimates for the regression population parameters (β 0 and β 1) are the values that minimize the residual sum of squares (SSresidual) between each observed value and the predicted value of the model:

Sum of squares Sum of cross products

Least-squares parameter estimates where

Sample variance of X: Sample covariance:

Solving for the intercept: Thus, our estimated regression equation is:

Hypothesis Tests with Regression • Null hypothesis is that there is no linear relationship between X and Y: H 0: β 1 = 0 Y i = β 0 + ε i H A: β 1 ≠ 0 Y i = β 0 + β 1 X i + ε i • We can use an F-ratio (i. e. , the ratio of variances) to test these hypotheses

Variance of the error of regression: NOTE: this is also referred to as residual variance, mean squared error (MSE) or residual mean square (MSresidual)

Mean square of regression: The F-ratio is: (MSRegression)/(MSResidual) This ratio follows the F-distribution with (1, n 2) degrees of freedom

Variance components and Coefficient of determination

Coefficient of determination

ANOVA table for regression Source Regression Degrees Sum of squares of freedom 1 Residual n-2 Total n-1 Mean square Expected mean square F ratio

Product-moment correlation coefficient

Parametric Confidence Intervals • If we assume our parameter of interest has a particular sampling distribution and we have estimated its expected value and variance, we can construct a confidence interval for a given percentile. • Example: if we assume Y is a normal random variable with unknown mean μ and variance σ2, then is distributed as a standard normal variable. But, since we don’t know σ, we must divide by the standard error instead: , giving us a tdistribution with (n-1) degrees of freedom. • The 100(1 -α)% confidence interval for μ is then given by: • IMPORTANT: this does not mean “There is a 100(1 -α)% chance that the true population mean μ occurs inside this interval. ” It means that if we were to repeatedly sample the population in the same way, 100(1 -α)% of the confidence intervals would contain the true population mean μ.

Publication form of ANOVA table for regression Source Regression Residual Total Sum of Squares df Mean Square F 21. 044 11. 479 1 11. 479 8. 182 15 . 545 19. 661 16 Sig. 0. 00035

Variance of estimated intercept

Variance of the slope estimator

Variance of the fitted value

Variance of the predicted value (Ỹ):

Regression

Assumptions of regression • The linear model correctly describes the functional relationship between X and Y • The X variable is measured without error • For a given value of X, the sampled Y values are independent with normally distributed errors • Variances are constant along the regression line

Residual plot for species-area relationship