Review of Building Multiple Regression Models Generalization of
Review of Building Multiple Regression Models • Generalization of univariate linear regression models. • One unit of data with a value of dependent variable and p independent variables.
Multiple Regression Model • Yi is value of dependent variable for i-th unit. • The values xi 1, xi 2, …, xip are values of the independent variables. • Zi is an unobservable error:
Objectives • Estimate the regression coefficients β 0, β 1, …, βp. • Estimate σ (crucial for tests). • Test whether the regression coefficients β 1, …, βp are all simultaneously zero (note that the intercept was left out). • Test whether some of the regression coefficients βq, …, βp are zero.
Assumptions for Multiple Regression • • Regression function is linear. Error terms are independent. Constant error variance. Distribution of errors is normal.
Context of your second project • Artificial data set, available on web site. • Each set is individual. – If you analyze the wrong data set, no credit! • Three dependent variables. – Three separate sections of your report! • Six independent variables. • 500 data points with replicated observations.
Check Scatterplots • Use scatterplot matrix to get a brief summary look. – Graphs, scatterplot, matrix. • If Y vs xi is flat and patternless, then your interpretation is that the regression coefficient of xi is xero. • Two of the dependent variables are random samples.
Table of regression coefficients • Contains the OLS estimates. • The line (constant) refers to β 0, the intercept. • There is a line for each variable in the model that refers to βq, the partial regression coefficient (slope) of the q-th independent variable.
Table of regression coefficients • Five columns of numbers • Two are labeled “unstandardized coefficients” – B column contains the OLS estimates. – Std. Error contains the estimated standard deviation.
Table of regression coefficients • One is the standardized coefficient. – Scale free coefficient often used in social science studies for comparison across studies. • There is a column for t. – As usual, t=(B-0)/(se B). • There is a column for sig. – Interpret as a p-value.
Interpretation • There appears to be an association between an independent variable and the dependent variable if the observed significance level is small for that coefficient. • Specify which variable has associations and the significant independent variables.
Refinement of Model • Rerun regression using only those variables that appear to be significant. • Usually, the database of a study has many variables that have no association with the dependent variable. • Most clients prefer that these variables not be used. – There are some technical problems with this approach that are widely ignored.
Strategy of Stepwise Regression • Let the computer do the work. • In regression box, specify stepwise. • The computer will see whether additional variables can be added or added variables deleted. • There are three basic strategies: forward selection, backward selection, and stepwise.
Using Stepwise Regression • Examine final model selected. • Note which variables are included. • Examine information for excluded variables. – Check whethere is any possibility that one of the variables left out might matter.
Checking the Model • Residual plots. • Diagnostics. • Lack of Fit test.
Residual Plots • Always plot unstandardized residuals against unstandardized predicted. • Plot unstandardized residuals against each independent variable in model. • If there is a time order to data, plot residuals in time order.
Diagnostics • Check for outliers. • Check for influential points. – Cook’s distance is useful. Deleting point with largest Cook’s distance causes the greatest change in the coefficients. • Box plot of residuals. • Q-Q plot of residuals.
Lack of Fit Test • Need replicated points (same settings of independent variables with different runs determining dependent variable). • Your data has replicated points. • Design your studies so that you can do a lack of fit test.
Approximate Lack of Fit Test • Statistics, Compare Means, One-way anova. • Dependent variable is residuals from regression model that you think is correct. • Independent variable is the second column of your data set. • Click OK.
Interpretation of Approximate Lack of Fit Test • If F test near one (observed significance level large), then the model that generated the residuals “appears to be adequate. ” • That is, there is no empirical reason to go on. • If F test is larger than one (small observed significance level), model should be improved.
Theory behind Lack of Fit Test • One way analysis of variance. • Covered next class. • Happy Thanksgiving.
- Slides: 20