Multiple Regression 1 Sociology 8811 Copyright 2007 by
Multiple Regression 1 Sociology 8811 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission
Announcements • None!
The Multiple Regression Model • Regression model for K independent variables:
Multiple Regression Slopes • Let’s look more closely at the slope formulas: • What happens to b 1 if X 1 and X 2 are totally uncorrelated? • Answer: The formula reduces to the bivariate • What if X 1 and X 2 are correlated with each other AND X 2 is more correlated with Y than X 1? • Answer: b 1 gets smaller (compared to bivariate)
Regression Slopes • So, if two variables (X 1, X 2) are correlated and both predict Y: • The X variable that is more correlated with Y will have a higher slope in multivariate regression – The slope of the less-correlated variable will shrink • Thus, slopes for each variable are adjusted to how well the other variable predicts Y – It is the slope “controlling” for other variables.
Multiple Regression Slopes • One last thing to keep in mind… • What happens to b 1 if X 1 and X 2 are almost perfectly correlated? • Answer: The denominator approaches Zero • The slope “blows up”, approaching infinity • Highly correlated independent variables can cause trouble for regression models… watch out
Interpreting Results • (Over)Simplified rules for interpretation – Assumes good sample, measures, models, etc. • Multivariate regression with two variables: A, B • If slopes of A, B are the same as bivariate, then each has an independent effect • If A remains large, B shrinks to zero we typically conclude that effect of B was spurious, or operates through A • If both A and B shrink a little, each has an effect, but some overlap or mediation is occurring
Interpreting Multivariate Results • Things to watch out for: • 1. Remember: Correlation is not causation – Ability to “control” for many variables can help detect spurious relationships… but it isn’t perfect. – Be aware that other (omitted) variables may be affecting your model. Don’t over-interpret results. • 2. Reverse causality – Many sociological processes involve bi-directional causality. Regression slopes (and correlations) do not identify which variable “causes” the other. • Ex: self-esteem and test scores.
Standardized Regression Coefficients • Regression slopes reflect the units of the independent variables • Question: How do you compare how “strong” the effects of two variables if they have totally different units? • Example: Education, family wealth, job prestige – Education measured in years, b = 2. 5 – Family wealth measured on 1 -5 scale, b =. 18 – Which is a “bigger” effect? Units aren’t comparable! • Answer: Create “standardized” coefficients
Standardized Regression Coefficients • Standardized Coefficients – Also called “Betas” or Beta Weights” – Symbol: Greek b with asterisk: b* – Equivalent to Z-scoring (standardizing) all independent variables before doing the regression • Formula of coeficient for Xj: • Result: The unit is standard deviations • Betas: Indicates the effect a 1 standard deviation change in Xj on Y
Standardized Regression Coefficients • Ex: Education, family income, and job prestige: An increase of 1 standard deviation in Education results in a. 52 standard deviation increase in job prestige What is the interpretation of the “family income” beta? Betas give you a sense of which variables “matter most”
R-Square in Multiple Regression • Multivariate R-square is much like bivariate: • But, SSregression is based on the multivariate regression • The addition of new variables results in better prediction of Y, less error (e), higher R-square.
R-Square in Multiple Regression • Example: • R-square of. 272 indicates that education, parents wealth explain 27% of variance in job prestige • “Adjusted R-square” is a more conservative, more accurate measure in multiple regression – Generally, you should report Adjusted R-square.
Dummy Variables • Question: How can we incorporate nominal variables (e. g. , race, gender) into regression? • Option 1: Analyze each sub-group separately – Generates different slope, constant for each group • Option 2: Dummy variables – “Dummy” = a dichotomous variables coded to indicate the presence or absence of something – Absence coded as zero, presence coded as 1.
Dummy Variables • Strategy: Create a separate dummy variable for all nominal categories • Ex: Gender – make female & male variables – DFEMALE: coded as 1 for all women, zero for men – DMALE: coded as 1 for all men • Next: Include all but one dummy variables into a multiple regression model • If two dummies, include 1; If 5 dummies, include 4.
Dummy Variables • Question: Why can’t you include DFEMALE and DMALE in the same regression model? • Answer: They are perfectly correlated (negatively): r = -1 – Result: Regression model “blows up” • For any set of nominal categories, a full set of dummies contains redundant information – DMALE and DFEMALE contain same information – Dropping one removes redundant information.
Dummy Variables: Interpretation • Consider the following regression equation: • Question: What if the case is a male? • Answer: DFEMALE is 0, so the entire term becomes zero. – Result: Males are modeled using the familiar regression model: a + b 1 X + e.
Dummy Variables: Interpretation • Consider the following regression equation: • Question: What if the case is a female? • Answer: DFEMALE is 1, so b 2(1) stays in the equation (and is added to the constant) – Result: Females are modeled using a different regression line: (a+b 2) + b 1 X + e – Thus, the coefficient of b 2 reflects difference in the constant for women.
Dummy Variables: Interpretation • Remember, a different constant generates a different line, either higher or lower – Variable: DFEMALE (women = 1, men = 0) – A positive coefficient (b) indicates that women are consistently higher compared to men (on dep. var. ) – A negative coefficient indicated women are lower • Example: If DFEMALE coeff = 1. 2: – “Women are on average 1. 2 points higher than men”.
Dummy Variables: Interpretation • Visually: Women = blue, Men = red Overall slope for all data points 10 9 8 Note: Line for men, women have same slope… but one is high other is lower. The constant differs! 7 6 5 4 3 HAPPY 2 1 0 0 INCOME 20000 40000 60000 80000 If women=1, men=0: The constant (a) reflects men only. Dummy 100000 coefficient (b) reflects increase for women (relative to men)
Dummy Variables • What if you want to compare more than 2 groups? • Example: Race – Coded 1=white, 2=black, 3=other (like GSS) • Make 3 dummy variables: – “DWHITE” is 1 for whites, 0 for everyone else – “DBLACK” is 1 for Af. Am. , 0 for everyone else – “DOTHER” is 1 for “others”, 0 for everyone else • Then, include two of the three variables in the multiple regression model.
Dummy Variables: Interpretation • Ex: Job Prestige • Negative coefficient for DBLACK indicates a lower level of job prestige compared to whites – T- and P-values indicate if difference is significant.
Dummy Variables: Interpretation • Comments: • 1. Dummy coefficients shouldn’t be called slopes – Referring to the “slope” of gender doesn’t make sense – Rather, it is the difference in the constant (or “level”) • 2. The contrast is always with the nominal category that was left out of the equation – If DFEMALE is included, the contrast is with males – If DBLACK, DOTHER are included, coefficients reflect difference in constant compared to whites.
Interaction Terms • Question: What if you suspect that a variable has a totally different slope for two different subgroups in your data? • Example: Income and Happiness – Perhaps men are more materialistic -- an extra dollar increases their happiness a lot – If women are less materialistic, each dollar has a smaller effect on income (compared to men) • Issue isn’t men = “more” or “less” than women – Rather, the slope of a variable (income) differs across groups
Interaction Terms • Issue isn’t men = “more” or “less” than women – Rather, the slope of a variable coefficient (for income) differs across groups • Again, we want to specify a different regression line for each group – We want lines with different slopes, not parallel lines that are higher or lower.
Interaction Terms • Visually: Women = blue, Men = red Overall slope for all data points 10 9 8 Note: Here, the slope for men and women differs. 7 6 5 The effect of income on happiness (X 1 on Y) varies with gender (X 2). This is called an “interaction effect” 4 3 HAPPY 2 1 0 0 INCOME 20000 40000 60000 80000 100000
Interaction Terms • Examples of interaction: – Effect of education on income may interact with type of school attended (public vs. private) • Private schooling has bigger effect on income – Effect of aspirations on educational attainment interacts with poverty • Aspirations matter less if you don’t have money to pay for college • Question: Can you think of examples of two variables that might interact? • Either from your final project? Or anything else?
Interaction Terms • Interaction effects: Differences in the relationship (slope) between two variables for each category of a third variable • Option #1: Analyze each group separately • Look for different sized slope in each group • Option #2: Multiply the two variables of interest: (DFEMALE, INCOME) to create a new variable – Called: DFEMALE*INCOME – Add that variable to the multiple regression model.
Interaction Terms • Consider the following regression equation: • Question: What if the case is male? • Answer: DFEMALE is 0, so b 2(DFEM*INC) drops out of the equation – Result: Males are modeled using the ordinary regression equation: a + b 1 X + e.
Interaction Terms • Consider the following regression equation: • Question: What if the case is female? • Answer: DFEMALE is 1, so b 2(DFEM*INC) becomes b 2*INCOME, which is added to b 1 – Result: Females are modeled using a different regression line: a + (b 1+b 2) X + e – Thus, the coefficient of b 2 reflects difference in the slope of INCOME for women.
Interpreting Interaction Terms • Interpreting interaction terms: • A positive b for DFEMALE*INCOME indicates the slope for income is higher for women vs. men – A negative effect indicates the slope is lower – Size of coefficient indicates actual difference in slope • Example: DFEMALE*INCOME. Observed b’s: – Income: b =. 5 – DFEMALE * INCOME: b = -. 2 • Interpretation: Slope is. 5 for men, . 3 for women.
Interpreting Interaction Terms • Example: Interaction of Race and Education affecting Job Prestige: DBLACK*EDUC has a negative effect (nearly significant). Coefficient of -. 576 indicates that the slope of education and job prestige is. 576 points lower for Blacks than for non-blacks.
Continuous Interaction Terms • Two continuous variables can also interact • Example: Effect of education and income on happiness – Perhaps highly educated people are less materialistic – As education increases, the slope between income and happiness would decrease • Simply multiply Education and Income to create the interaction term “EDUCATION*INCOME” • And add it to the model.
Interpreting Interaction Terms • How do you interpret continuous variable interactions? • Example: EDUCATION*INCOME: Coefficient = 2. 0 • Answer: For each unit change in education, the slope of income vs. happiness increases by 2 – Note: coefficient is symmetrical: For each unit change in income, education slope increases by 2 • Dummy interactions effectively estimate 2 slopes: one for each group • Continuous interactions result in many slopes: Each value of education*income yields a different slope.
Interpreting Interaction Terms • Interaction terms alters the interpretation of “main effect” coefficients • Including “EDUC*INCOME changes the interpretation of EDUC and of INCOME • See Allison p. 166 -9 – Specifically, coefficient for EDUC represents slope of EDUC when INCOME = 0 • Likewise, INCOME shows slope when EDUC=0 – Thus, main effects are like “baseline” slopes • And, the interaction effect coefficient shows how the slope grows (or shrinks) for a given unit change.
Dummy Interactions • It is also possible to construct interaction terms based on two dummy variables – Instead of a “slope” interaction, dummy interactions show difference in constants • Constant (not slope) differs across values of a third variable – Example: Effect of of race on school success varies by gender • African Americans do less well in school; but the difference is much larger for black males.
Dummy Interactions • Strategy for dummy interaction is the same: Multiply both variables – Example: Multiply DBLACK, DMALE to create DBLACK*DMALE • Then, include all 3 variables in the model – Effect of DBLACK*DMALE reflects difference in constant (level) for black males, compared to white males and black females • You would observe a negative coefficient, indicating that black males fare worse in schools than black females or white males.
Interaction Terms: Remarks • 1. If you make an interaction you should also include the component variables in the model: – A model with “DFEMALE * INCOME” should also include DFEMALE and INCOME • There are rare exceptions. But when in doubt, include them • 2. Sometimes interaction terms are highly correlated with its components • That can cause problems (multicollinearity – which we’ll discuss more soon)
Interaction Terms: Remarks • 3. Make sure you have enough cases in each group for your interaction terms – Interaction terms involve estimating slopes for subgroups (e. g. , black females vs black males). • If you there are hardly any black females in the dataset, you can have problems • 4. “Three-way” interactions are also possible! • An interaction effect that varies across categories of yet another variable – Ex: DMale*DBlack interaction may vary across class • They are mainly used in experimental research settings with large sample sizes… but they are possible.
- Slides: 39