Multiple Regression 4 Sociology 5811 Lecture 25 Copyright

  • Slides: 54
Download presentation
Multiple Regression 4 Sociology 5811 Lecture 25 Copyright © 2005 by Evan Schofer Do

Multiple Regression 4 Sociology 5811 Lecture 25 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Announcements • Schedule: – Today: Multiple regression hypothesis tests, assumptions, and problems – Next

Announcements • Schedule: – Today: Multiple regression hypothesis tests, assumptions, and problems – Next Class: More diagnostics • Including “outliers”, which you should address for the final paper. Don’t miss class! • Reminder: Final paper deadline coming up soon! • Questions about the paper?

Review: Interaction Terms • Interaction Terms: Effect of a variable changes within groups or

Review: Interaction Terms • Interaction Terms: Effect of a variable changes within groups or levels of a third • Example: Effect of income on happiness may be different for women and men • If men are more materialistic, each dollar has a bigger effect • Issue isn’t men = “more” or “less” than women – Rather, the slope of a variable coefficient (for income) differs across groups • Essentially: different regression line (slope) for each group.

Review: Interaction Terms • Visually: Women = blue, Men = red Overall slope for

Review: Interaction Terms • Visually: Women = blue, Men = red Overall slope for all data points 10 9 8 Note: Here, the slope for men and women differs. 7 6 5 The effect of income on happiness (X 1 on Y) varies with gender (X 2). This is called an “interaction effect” 4 3 HAPPY 2 1 0 0 INCOME 20000 40000 60000 80000 100000

Review: Interaction Terms • Examples of interaction: – Effect of education on income may

Review: Interaction Terms • Examples of interaction: – Effect of education on income may interact with type of school attended (public vs. private) • Private schooling has bigger effect on income – Effect of aspirations on educational attainment interacts with poverty • Aspirations matter less if you don’t have money to pay for college.

Review: Interaction Terms • Interaction effects: Differences in the relationship (slope) between two variables

Review: Interaction Terms • Interaction effects: Differences in the relationship (slope) between two variables for each category of a third variable • Option #1: Analyze each group separately • Look for different sized slope in each group • Option #2: Multiply the two variables of interest: (DFEMALE, INCOME) to create a new variable – Called: DFEMALE*INCOME – Add that variable to the multiple regression model.

Review: Interaction Terms • Example: Interaction of Race and Education affecting Job Prestige: DBLACK*EDUC

Review: Interaction Terms • Example: Interaction of Race and Education affecting Job Prestige: DBLACK*EDUC has a negative effect (nearly significant). Coefficient of -. 576 indicates that the slope of education and job prestige is. 576 points lower for Blacks than for non-blacks.

Continuous Interaction Terms • Two continuous variables can also interact • Example: Effect of

Continuous Interaction Terms • Two continuous variables can also interact • Example: Effect of education and income on happiness – Perhaps highly educated people are less materialistic – As education increases, the slope between income and happiness would decrease • Simply multiply Education and Income to create the interaction term “EDUCATION*INCOME” • And add it to the model.

Interpreting Interaction Terms • How do you interpret continuous variable interactions? • Example: EDUCATION*INCOME:

Interpreting Interaction Terms • How do you interpret continuous variable interactions? • Example: EDUCATION*INCOME: Coefficient = 2. 0 • Answer: For each unit change in education, the slope of income vs. happiness increases by 2 – Note: coefficient is symmetrical: For each unit change in income, education slope increases by 2 • Dummy interactions effectively estimate 2 slopes: one for each group • Continuous interactions result in many slopes: Each value of education*income yields a different slope.

Interpreting Interaction Terms • Interaction terms alters the interpretation of “main effect” coefficients •

Interpreting Interaction Terms • Interaction terms alters the interpretation of “main effect” coefficients • Including “EDUC*INCOME changes the interpretation of EDUC and of INCOME • See Allison p. 166 -9 – Specifically, coefficient for EDUC represents slope of EDUC when INCOME = 0 • Likewise, INCOME shows slope when EDUC=0 – Thus, main effects are like “baseline” slopes • And, the interaction effect coefficient shows how the slope grows (or shrinks) for a given unit change.

Dummy Interactions • It is also possible to construct interaction terms based on two

Dummy Interactions • It is also possible to construct interaction terms based on two dummy variables – Instead of a “slope” interaction, dummy interactions show difference in constants • Constant (not slope) differs across values of a third variable – Example: Effect of of race on school success varies by gender • African Americans do less well in school; but the difference is much larger for black males.

Dummy Interactions • Strategy for dummy interaction is the same: Multiply both variables –

Dummy Interactions • Strategy for dummy interaction is the same: Multiply both variables – Example: Multiply DBLACK, DMALE to create DBLACK*DMALE • Then, include all 3 variables in the model – Effect of DBLACK*DMALE reflects difference in constant (level) for black males, compared to white males and black females • You would observe a negative coefficient, indicating that black males fare worse in schools than black females or white males.

Interaction Terms: Remarks • 1. If you make an interaction you should also include

Interaction Terms: Remarks • 1. If you make an interaction you should also include the component variables in the model: – A model with “DFEMALE * INCOME” should also include DFEMALE and INCOME • There are rare exceptions. But when in doubt, include them • 2. Sometimes interaction terms are highly correlated with its components • That can cause problems (multicollinearity – which we’ll discuss more soon)

Interaction Terms: Remarks • 3. Make sure you have enough cases in each group

Interaction Terms: Remarks • 3. Make sure you have enough cases in each group for your interaction terms – Interaction terms involve estimating slopes for subgroups (e. g. , black females vs black males). • If you there are hardly any black females in the dataset, you can have problems • 4. “Three-way” interactions are also possible! • An interaction effect that varies across categories of yet another variable – Ex: DMale*DBlack interaction may vary across class • They are mainly used in experimental research settings with large sample sizes… but they are possible.

Multiple Regression Hypothesis Tests • Hypothesis tests can be conducted independently for all slopes

Multiple Regression Hypothesis Tests • Hypothesis tests can be conducted independently for all slopes (b) of X variables • For X 1, X 2…Xk, we can test hypotheses for b 1, b 2…bk • Null/Alternative hypotheses are the same: • H 0: bk = 0 • H 1: bk 0; Or, one-tailed tests: H 1: bk > 0, H 1: bk < 0 • Hypothesis tests are about the slope controlling for other variables in the model • Sometimes people explicitly mention this in hypotheses • NOTE: Results with “controls” may differ from bivariate hypothesis tests!

Multiple Regression Hypothesis Tests • Formula for MV hypothesis tests: • • Where b

Multiple Regression Hypothesis Tests • Formula for MV hypothesis tests: • • Where b is a slope, sb is a standard error k represents the kth independent variable K = total number of independent variables T-test degrees of freedom depends on N and number of independent variables • Compare observed t-value to critical t; or p to a.

Multiple Regression Estimation • Calculating b’s involves solving a set of equations to minimize

Multiple Regression Estimation • Calculating b’s involves solving a set of equations to minimize squared error • Analogous to bivariate, but math is more complex • The optimal estimator has minimum variance and is referred to as “BLUE”: • Best Linear, Unbiased Estimate • The BLUE Multiple Regression has more assumptions than bivariate.

Multiple Regression Assumptions • As discussed in Knoke, p. 256 • Note: Allison refers

Multiple Regression Assumptions • As discussed in Knoke, p. 256 • Note: Allison refers to error (e) as disturbance (U); And uses slightly different language… but ideas are the same! • 1. a. Linearity: The relationship between dependent and independent variables is linear • Just like bivariate regression • Points don’t all have to fall exactly on the line; but error (disturbance) must not have a pattern – Check scatterplots of X’s and error (residual) • Watch out for non-linear trends: error is systematically negative (or positive) for certain ranges of X • There are strategies to cope with non-linearity, such as including X and X-squared to model curved relationship.

Multiple Regression Assumptions • 1. b. And, the model is properly specified: – No

Multiple Regression Assumptions • 1. b. And, the model is properly specified: – No extra variables are included in the model, and no important variables are omitted. This is HARD! • Correct model specification is critical • If an important variable is left out of the model, results are biased (“omitted variable bias”) – Example: If we model job prestige as a function of family wealth, but do not include education • Coefficient estimate for wealth would be biased – Use theory and previous research to decide what critical variables must be included in your model.

Multiple Regression Assumptions • 2. All variables are measured without error • Unfortunately, error

Multiple Regression Assumptions • 2. All variables are measured without error • Unfortunately, error is common in measures – Survey questions can be biased – People give erroneous responses (or lie) – Aggregate statistics (e. g. , GDP) can be inaccurate • This assumption is often violated to some extent – We do the best we can: • Design surveys well, use best available data • And, there advanced methods for dealing with measurement error.

Multiple Regression Assumptions • 3. The error term (ei) has certain properties • Recall:

Multiple Regression Assumptions • 3. The error term (ei) has certain properties • Recall: error is a cases deviation from the regression line – Not the same as measurement error! • After you run a regression, SPSS can tell you the error value for any or all cases (called the “residual”) • 3. a. Error is conditionally normal – For bivariate, we looked to see if Y was conditionally normal. For multivariate regression, we look to see if error is conditionally normal • Examine “residuals” (ei) for normality at different values of X variables.

Multiple Regression Assumptions • 3. b. The error term (ei) has a mean of

Multiple Regression Assumptions • 3. b. The error term (ei) has a mean of 0 – This affects the estimate of the constant (Not a huge problem) • This is not a critical assumption to test. • 3. c. The error term (ei) is homoskedastic (has constant variance) • Note: This affects standard error estimates, hypothesis tests – Look at residuals, to see if they spread out with changing values of X • Or plot standardized residuals vs. standardized predicted values.

Multiple Regression Assumptions • 3. d. Predictors (Xis) are uncorrelated with error – This

Multiple Regression Assumptions • 3. d. Predictors (Xis) are uncorrelated with error – This most often happens when we leave out an important variable that is correlated with another Xi – Example: Predicting job prestige with family wealth, but not including education – Omission of education will affect error term. Those with lots of education will have large positive errors. • Since wealth is correlated with education, it will be correlated with that error! – Result: coefficient for family wealth will be biased (vastly overestimated).

Multiple Regression Assumptions • 4. In systems of equations, error terms of equations are

Multiple Regression Assumptions • 4. In systems of equations, error terms of equations are uncorrelated • Knoke, p. 256 – This is not a concern for us in this class • Worry about that later!

Multiple Regression Assumptions • 5. Sample is independent, errors are random • Technically, part

Multiple Regression Assumptions • 5. Sample is independent, errors are random • Technically, part of 3. c. – Not only should errors not increase with X (heteroskedasticity), there should be no pattern at all! • Things that cause patterns in error (autocorrelation): – Measuring data over long periods of time (e. g. , every year). Error from nearby years may be correlated. • Called: “Serial correlation”.

Multiple Regression Assumptions • More things that cause patterns in error (autocorrelation): – Measuring

Multiple Regression Assumptions • More things that cause patterns in error (autocorrelation): – Measuring data in families. All members are similar, will have correlated error – Measuring data in geographic space. • Example: data on 50 US states. States in a similar region have correlated error • Called “spatial autocorrelation” • There are variations of regression models to address each kind of correlated error.

Multiple Regression Assumptions • Regression assumptions and final projects: • 1. Check all your

Multiple Regression Assumptions • Regression assumptions and final projects: • 1. Check all your assumptions… but present results for only 1 or 2 X variables. • 2. Multivariate assumption checks involve plots of e (“error” or “residual”) to test linearity, heteroskedasticity • This contrasts bivariate, where you plotted X vs. Y • Don’t forget to focus on “e”! • 3. Also, you should check for outliers • To be discussed soon!

Regression: Outliers • Note: Even if regression assumptions are met, slope estimates can have

Regression: Outliers • Note: Even if regression assumptions are met, slope estimates can have problems • Example: Outliers -- cases with extreme values that differ greatly from the rest of your sample • More formally: “influential cases” • Outliers can result from: • Errors in coding or data entry • Highly unusual cases • Or, sometimes they reflect important “real” variation • Even a few outliers can dramatically change estimates of the slope, especially if N is small.

Regression: Outliers • Outlier Example: Extreme case that pulls regression line up 4 2

Regression: Outliers • Outlier Example: Extreme case that pulls regression line up 4 2 -4 -2 0 -2 -4 2 4 Regression line with extreme case removed from sample

Regression: Outliers • Strategy for identifying outliers: • 1. Look at scatterplots or regression

Regression: Outliers • Strategy for identifying outliers: • 1. Look at scatterplots or regression partial plots for extreme values • Easiest. A minimum for final projects • 2. Ask SPSS to compute outlier diagnostic statistics – Examples: “Leverage”, Cook’s D, DFBETA, residuals, standardized residuals.

Regression: Outliers • SPSS Outlier strategy: Go to Regression – Save – Choose “influence”

Regression: Outliers • SPSS Outlier strategy: Go to Regression – Save – Choose “influence” and “distance” statistics such as Cook’s Distance, DFFIT, standardized residual – Result: SPSS will create new variables with values of Cook’s D, DFFIT for each case – High values signal potential outliers – Note: This is less useful if you have a VERY large dataset, because you have to look at each case value.

Scatterplots • Example: Study time and student achievement. – X variable: Average # hours

Scatterplots • Example: Study time and student achievement. – X variable: Average # hours spent studying per day – Y variable: Score on reading test Case 1 2 3 4 5 6 7 X 2. 6 1. 4. 65 4. 1. 25 1. 9 3. 5 Y 28 13 17 31 8 16 6 Y axis 30 20 10 0 X axis 0 1 2 3 4

Outliers • Results with outlier:

Outliers • Results with outlier:

Outlier Diagnostics • Residuals: The numerical value of the error – Error = distance

Outlier Diagnostics • Residuals: The numerical value of the error – Error = distance that points falls from the line – Cases with unusually large error may be outliers – Note: residuals have many other uses! • Standardized residuals – Z-score of residuals… converts to a neutral unit – Often, standardized residuals larger than 3 are considered worthy of scrutiny • But, it isn’t the best outlier diagnostic.

Outlier Diagnostics • Cook’s D: Identifies cases that are strongly influencing the regression line

Outlier Diagnostics • Cook’s D: Identifies cases that are strongly influencing the regression line – SPSS calculates a value for each case • Go to “Save” menu, click on Cook’s D • How large of a Cook’s D is a problem? – Rule of thumb: Values greater than: 4 / (n – k – 1) – Example: N=7, K = 1: Cut-off = 4/5 =. 80 – Cases with higher values should be examined.

Outlier Diagnostics • Example: Outlier/Influential Case Statistics Hours 2. 60 1. 40. 65 4.

Outlier Diagnostics • Example: Outlier/Influential Case Statistics Hours 2. 60 1. 40. 65 4. 10. 25 1. 90 3. 50 Score 28 13 17 31 8 16 6 Resid 9. 32 -1. 97 4. 33 7. 70 -3. 43 -. 515 -15. 4 Std Resid 1. 01 -. 215. 473. 841 -. 374 -. 056 -1. 68 Cook’s D. 124. 006. 070. 640. 082. 0003. 941

Outliers • Results with outlier removed:

Outliers • Results with outlier removed:

Regression: Outliers • Question: What should you do if you find outliers? Drop outlier

Regression: Outliers • Question: What should you do if you find outliers? Drop outlier cases from the analysis? Or leave them in? – Obviously, you should drop cases that are incorrectly coded or erroneous – But, generally speaking, you should be cautious about throwing out cases • If you throw out enough cases, you can produce any result that you want! So, be judicious when destroying data.

Regression: Outliers • Circumstances where it can be good to drop outlier cases: •

Regression: Outliers • Circumstances where it can be good to drop outlier cases: • 1. Coding errors • 2. Single extreme outliers that radically change results • Your results should reflect the dataset, not one case! • 3. If there is a theoretical reason to drop cases – Example: In analysis of economic activity, communist countries may be outliers • If the study is about “capitalism”, they should be dropped.

Regression: Outliers • Circumstances when it is good to keep outliers • 1. If

Regression: Outliers • Circumstances when it is good to keep outliers • 1. If they form meaningful cluster – Often suggests an important subgroup in your data • Example: Asian-Americans in a dataset on education • In such a case, consider adding a dummy variable for them – Unless, of course, research design is not interested in that sub-group… then drop them! • 2. If there are many – Maybe they reflect a “real” pattern in your data.

Regression: Outliers • When in doubt: Present results both with and without outliers •

Regression: Outliers • When in doubt: Present results both with and without outliers • Or present one set of results, but mention how results differ depending on how outliers were handled • For final projects: Check for outliers! • At least with scatterplots • But, a better strategy is to use partialplots and Cooks D (or similar statistics) – In the text: Mention if there were outliers, how you handled them, and the effect it had on results.

Extra Slides

Extra Slides

Review • Types of regression variables, and interpretation of coefficients: • 1. Normal variable

Review • Types of regression variables, and interpretation of coefficients: • 1. Normal variable coefficient: Reflect slope of line relating one variable to the dependent var • The effect of a 1 -point change in X on Y • 2. Dummy variable: Reflects difference in the constant for a group compared to omitted group • Here, the effect is the difference in constant (level) of Y for different groups.

Review • 3. Interaction term: Dummy * Continuous: Indicates differences in slope for different

Review • 3. Interaction term: Dummy * Continuous: Indicates differences in slope for different groups • Example: DFEMALE*Education affecting income – Coefficient indicates difference in slope for dummy group compared to slope of reference group • 4. Interaction term: Dummy * Dummy: Indicate differences in the constant • Example: DFEMALE*DBLACK – Coefficient indicates difference in constant between black females and black males (and white females).

Review • 5. Interaction term: Continuous * Continuous: Indicates differences in slope for different

Review • 5. Interaction term: Continuous * Continuous: Indicates differences in slope for different values of other variable • Example: Parents. Wealth*Education affecting income – Coefficient indicates difference in slope for each unit change in other continuous variable.

Log Transformations • 1. Linearity and log transformations: When should you log your variables?

Log Transformations • 1. Linearity and log transformations: When should you log your variables? • There are two common reasons: – 1. To reduce extreme skewness (which often leads to non-linearity – 2. For variables where the social meaning is clearly non-linear.

Log Transformations • Example: Country GDP per capita – Highly skewed – Also, a

Log Transformations • Example: Country GDP per capita – Highly skewed – Also, a shift from $1000 to $2000 is much more socially significant than shift from $30, 000 to 31, 000 • Other example: wages (interval) • Log transformations should be used judiciously • Don’t log all variables to achieve a modest improvement in linearity.

Multiple Regression Problems • Another common regression problem: Multicollinearity • Definition: collinear = highly

Multiple Regression Problems • Another common regression problem: Multicollinearity • Definition: collinear = highly correlated – Multicollinearity = inclusion of highly correlated independent variables in a single regression model • Recall: High correlation of X variables causes problems for estimation of slopes (b’s) – Recall: variable denominators approach zero, coefficients may wrong/too large.

Multiple Regression Problems • Multicollinearity symptoms: – Addition of a new variable to the

Multiple Regression Problems • Multicollinearity symptoms: – Addition of a new variable to the model causes other variables to change wildly • Note: occasionally a major change is expected (e. g. , if a key variable is added, or for continuous interaction terms) – If a variable typically has a small effect; but when paired with another variable, BOTH have big effects in opposite directions.

Multiple Regression Problems • Diagnosing multicollinearity: • 1. Look at correlations of all independent

Multiple Regression Problems • Diagnosing multicollinearity: • 1. Look at correlations of all independent vars – Watch out for variables with correlations above. 7 – Correlations of over. 9 are really bad • 2. Use advanced tools: – Tolerances, VIF (Variance Inflation Factor) • 3. Watch out for symptoms mentioned previously.

Multiple Regression Problems • Solutions to multcollinearity – It can be difficult if a

Multiple Regression Problems • Solutions to multcollinearity – It can be difficult if a fully specified model requires several collinear variables • 1. Drop unnecessary variables • 2. If two collinear variables are really measuring the same thing, drop one or make an index – Example: Attitudes toward recycling; attitude toward pollution. Perhaps they reflect “environmental views” • 3. Advanced techniques: e. g. , Quantile regression, Ridge regression.

Entering Variables Into Regressions • Question: For final papers, how should you enter variables

Entering Variables Into Regressions • Question: For final papers, how should you enter variables into a regression? • Forward, backward, stepwise, or all at once? – I recommend entering variables all at once, rather than using an automated procedure • Automated procedures are more useful for advanced models – It is often interesting to present more than one model • Example: Show coefficients change with the addition of new variables.