The Power of Regression Previous Research Literature Claim

The Power of Regression Dependent Variable: Strike Incidence (1) U. S. Corporate Parent (Canadian

Important Regression Topics • Prediction • Various confidence and prediction intervals • Diagnostics •

Confidence Intervals • The true population [whatever] is within the following interval (1 -

Prediction Interval for New Observation at xp 1. Point Estimate 2. Standard Error 3.

Prediction Interval for Mean Observations at xp 1. Point Estimate 2. Standard Error 3.

Earlier Example Hours of Study (x) and Exam Regression Statistics Score (y) Example Multiple

Diagnostics / Misspecification • For estimation & testing to be valid… • y =

Common Problems • Misspecification • Omitted variable bias • Nonlinear rather than linear relationship

Omitted Variable Bias • Question 3 from Sample Exam B wage = 9. 05

Checking the Assumptions • How to check the validity of the assumptions? • Cynicism,

Diagnostic Plots Increasing spread might indicate heteroskedasticity. Try transformations or weighted least squares. 12

Diagnostic Plots “Tilt” from outliers might indicate skewness. Try log transformation 13

Problematic Outliers Stock Performance and CEO Golf Handicaps (New York Times, 5 -31 -98)

Are They Really Outliers? ? Diagnostic Plot is OK BE CAREFUL! Stock Performance and

Diagnostic Plots Curvature might indicate nonlinearity. Try quadratic specification 16

Diagnostic Plots Good diagnostic plot. Lacks obvious indications of other problems. 17

Adding Squared (Quadratic) Term Job Performance regression on Salary (in $1, 000 s) (Egg

Quadratic Regression Job perf = -1. 72 + 0. 098 salary – 0. 00034

Another Specification Possibility • If data are very skewed, can try a log specification

Quick Note on Logs • a is the natural logarithm of x if: 2.

Earnings Distribution Skewed to the right Weekly Earnings from the March 2002 CPS, n=15,

Residuals from Levels Regression Skewed to the right— use of t distribution is suspect

Log Earnings Distribution Not perfectly symmetrical, but better Natural Logarithm of Weekly Earnings from

Residuals from Log Regression Almost symmetrical —use of t distribution is probably OK Residuals

Hypothesis Tests • We’ve been doing hypothesis tests for single coefficients • H 0:

Partial F Tests • Statistically, need to distinguish between • Full regression “no better”

Partial F Tests • SSresidual = Sum of Squares Residual • C = #constraints

Coal Mining Example (Again) Regression Statistics R Squared 0. 955 Adj. R Squared 0.

Minitab Output Predictor Constant hours tons unemp WWII Act 1952 Act 1969 S =

Is the Overall Model Significant? H 0: 1 = 2 = 3 = …

Constant-Only Model Regression Statistics R Squared 0 Adj. R Squared 0 Standard Error 476.

Partial F Tests = 142. 406 H 0: 1 = 2 = 3 =

Denominator Degrees of Freedom Select F Distribution 5% Critical Values Numerator Degrees of Freedom

A Small Shortcut Regression Statistics R Squared 0. 955 Adj. R Squared 0. 949

An Even Better Shortcut Regression Statistics R Squared 0. 955 Adj. R Squared 0.

Testing Any Subset Regression Statistics R Squared Partial F test can be 0. 955

Restricted Model Regression Statistics R Squared 0. 955 Adj. R Squared 0. 949 Standard

Partial F Tests = 3. 950 H 0: WWII = Act 1952 = Act

Blocks Regression and Two-Way ANOVA 1 2 3 4 5 Treatments A B C

Recall Two-Way Results ANOVA: Two-Factor Without Replication Source of Variation SS df MS 312.

Regression and Two-Way ANOVA Source | SS df MS -----+-----------Model | 338. 800 6

Regression and Two-Way ANOVA Regression Excerpt for Full Model Source | SS df MS

Denominator Degrees of Freedom Select F Distribution 5% Critical Values 1 1 161 2

Regression Coefficients • y = b 0 + b 1 x (linear form) 1

Log Regression Coefficients • wage = 9. 05 + 1. 39 union • Predicted

Multicollinearity Auto repair records, weight, and engine size Number of obs = 69 F(

Multicollinearity • Two (or more) independent variables are so highly correlated that a multiple

Prediction With Multicollinearity • Prediction at the Mean (weight=3019 and engine=197) Model for prediction

Dummy Dependent Variables • Dummy dependent variables • y = b 0 + b

Linear Probability Model • Mathematically / computationally, can estimate a regression as usual (the

Linear Probability Model • Excel won’t know the difference, but perhaps it should •

Logit Model & Probit Model • Solution to these problems is to use nonlinear

Logit Model & Probit Model • Nonlinear so need statistical package to do the

Example • Dep. Var: 1 if you know of the FMLA, 0 otherwise Probit

Marginal Effects • For numerical interpretation / prediction, need to convert coefficients to marginal

Marginal Effects Probit estimates Number of obs = 1189 LR chi 2(14) = 232.

But Linear Probability Model is OK, Too Probit Coeff. 0. 238 (0. 101) Probit

Slides: 61

Download presentation

The Power of Regression • Previous Research Literature Claim • Foreign-owned manufacturing plants have greater levels of strike activity than domestic plants • In Canada, strike rates of 25. 5% versus 20. 3% • Budd’s Claim • Foreign-owned plants are larger and located in strike -prone industries • Need multivariate regression analysis! 1

The Power of Regression Dependent Variable: Strike Incidence (1) U. S. Corporate Parent (Canadian Parent omitted) 0. 230** (0. 117) (2) (3) 0. 201* (0. 119) 0. 065 (0. 132) 0. 177** (0. 019) 0. 094** (0. 020) Number of Employees (1000 s) --- Industry Effects? No No Yes 2, 170 Sample Size * Statistically significant at the 0. 10 level; ** at the 0. 05 level (two-tailed tests). 2

Important Regression Topics • Prediction • Various confidence and prediction intervals • Diagnostics • Are assumptions for estimation & testing fulfilled? • Specifications • Quadratic terms? Logarithmic dep. vars. ? • Additional hypothesis tests • Partial F tests • Dummy dependent variables • Probit and logit models 3

Confidence Intervals • The true population [whatever] is within the following interval (1 - )% of the time: Estimate ± t /2 Standard Error. Estimate • Just need • Estimate • Standard Error • Shape / Distribution (including degrees of freedom) 4

Prediction Interval for New Observation at xp 1. Point Estimate 2. Standard Error 3. Shape • t distribution with n-k-1 d. f 4. So prediction interval for a new observation is Siegel, p. 481 5

Prediction Interval for Mean Observations at xp 1. Point Estimate 2. Standard Error 3. Shape • t distribution with n-k-1 d. f 4. So prediction interval for a new observation is Siegel, p. 483 6

Earlier Example Hours of Study (x) and Exam Regression Statistics Score (y) Example Multiple R 0. 770 R Squared 0. 594 Adj. R Squared 0. 543 Standard Error 10. 710 Obs. 1. Find 95% CI for Joe’s exam score (studies for 20 hours) 2. Find 95% CI for mean score for those who studied for 20 hours 10 ANOVA df SS MS F Significance Regression 1 1340. 452 1341. 452 11. 686 0. 009 Residual 8 917. 648 114. 706 Total 9 2258. 100 Coeff. Std. Error t stat p value Lower 95% Upper 95% Intercept 39. 401 12. 153 3. 242 0. 012 11. 375 67. 426 hours 2. 122 0. 621 3. 418 0. 009 0. 691 3. 554 -x = 18. 80 7

Diagnostics / Misspecification • For estimation & testing to be valid… • y = b 0 + b 1 x 1 + b 2 x 2 + … + bkxk + e makes sense • Errors (ei) are independent • of each other Violations render • of the independent variables our inferences • Homoskedasticity invalid and misleading! • Error variance independent of the independent variables • e 2 is a constant • Var(ei) xi 2 (i. e. , not heteroskedasticity) 8

Common Problems • Misspecification • Omitted variable bias • Nonlinear rather than linear relationship • Levels, logs, or percent changes? • Data Problems • Skewed variables and outliers • Multicollinearity • Sample selection (non-random data) • Missing data • Problems with residuals (error terms) • Non-independent errors • Heteroskedasticity 9

Omitted Variable Bias • Question 3 from Sample Exam B wage = 9. 05 + 1. 39 union (1. 65) (0. 66) wage = 9. 56 + 1. 42 union + 3. 87 ability (1. 49) (0. 56) (1. 56) wage = -3. 03 + 0. 60 union + 0. 25 revenue (0. 70) (0. 45) (0. 08) • H. Farber thinks the average union wage is different from average nonunion wage because unionized employers are more selective and hire individuals with higher ability. • M. Friedman thinks the average union wage is different from the average nonunion wage because unionized employers have different levels of revenue per employee. 10

Checking the Assumptions • How to check the validity of the assumptions? • Cynicism, Realism, and Theory • Robustness Checks • Check different specifications • But don’t just choose the best one! • Automated Variable Selection Methods • e. g. , Stepwise regression (Siegel, p. 547) • Misspecification and Other Tests • Examine Diagnostic Plots 11

Diagnostic Plots Increasing spread might indicate heteroskedasticity. Try transformations or weighted least squares. 12

Diagnostic Plots “Tilt” from outliers might indicate skewness. Try log transformation 13

Problematic Outliers Stock Performance and CEO Golf Handicaps (New York Times, 5 -31 -98) Number of obs = 44 R-squared = 0. 1718 ------------------------stockrating | Coef. Std. Err. t P>|t| -------+-----------------handicap | -1. 711. 580 -2. 95 0. 005 _cons | 73. 234 8. 992 8. 14 0. 000 ------------------------ Without 7 “Outliers” Number of obs = 51 R-squared = 0. 0017 ------------------------stockrating | Coef. Std. Err. t P>|t| -------+-----------------handicap | -. 173. 593 -0. 29 0. 771 _cons | 55. 137 9. 790 5. 63 0. 000 ------------------------ With the 7 “Outliers” 14

Are They Really Outliers? ? Diagnostic Plot is OK BE CAREFUL! Stock Performance and CEO Golf Handicaps (New York Times, 5 -31 -98) 15

Diagnostic Plots Curvature might indicate nonlinearity. Try quadratic specification 16

Diagnostic Plots Good diagnostic plot. Lacks obvious indications of other problems. 17

Adding Squared (Quadratic) Term Job Performance regression on Salary (in $1, 000 s) (Egg Data) Source | SS df MS Number of obs = 576 ------- -+----------F(2, 573) = 122. 42 Model | 255. 61 2 127. 8 Prob > F = 0. 0000 Residual | 598. 22 573 1. 044 R-squared = 0. 2994 -----+----------Adj R-squared = 0. 2969 Total | 853. 83 575 1. 485 Root MSE = 1. 0218 --------+----------------------job performance| Coef. Std. Err. t P>|t| --------+----------------------salary |. 0980844. 0260215 3. 77 0. 000 salary squared | -. 000337. 0001905 -1. 77 0. 077 _cons | -1. 720966. 8720358 -1. 97 0. 049 ------------------------------ Salary Squared = Salary 2 [=salary^2 in Excel] 18

Quadratic Regression Job perf = -1. 72 + 0. 098 salary – 0. 00034 salary squared Quadratic regression (nonlinear) 19

Quadratic Regression Job perf = -1. 72 + 0. 098 salary – 0. 00034 salary squared -linear coeff. Max = 2*quadratic coeff. But where? Effect of salary will eventually turn negative 20

Another Specification Possibility • If data are very skewed, can try a log specification • Can use logs instead of levels for independent and/or dependent variables • Note that the interpretation of the coefficients will change • Re-familiarize yourself with Siegel, pp. 68 -69 21

Quick Note on Logs • a is the natural logarithm of x if: 2. 71828 a = x a or, e =x • The natural logarithm is abbreviated “ln” • ln(x) = a • In Excel, use ln function • We call this the “log” but don’t use the “log” function! • Usefulness: spreads out small values and narrows large values which can reduce skewness 22

Earnings Distribution Skewed to the right Weekly Earnings from the March 2002 CPS, n=15, 000 23

Residuals from Levels Regression Skewed to the right— use of t distribution is suspect Residuals from a regression of Weekly Earnings on demographic characteristics 24

Log Earnings Distribution Not perfectly symmetrical, but better Natural Logarithm of Weekly Earnings from the March 2002 CPS, i. e. , =ln(weekly earnings) 25

Residuals from Log Regression Almost symmetrical —use of t distribution is probably OK Residuals from a regression of Log Weekly Earnings on demographic characteristics 26

Hypothesis Tests • We’ve been doing hypothesis tests for single coefficients • H 0: = 0 reject if |t| > t /2, n-k-1 • H A: 0 • What about testing more than one coefficient at the same time? • e. g. , want to see if an entire group of 10 dummy variables for 10 industries should be in the model • Joint tests can be conducted using partial F tests 27

Partial F Tests H 0: 1 = 2 = 3 = … = C = 0 HA: at least one i 0 • How to test this? • Consider two regressions • One as if H 0 is true • i. e. , 1 = 2 = 3 = … = C = 0 • This is a “restricted” (or constrained) model • Plus a “full” (or unconstrained) model in which the computer can estimate what it wants for each coefficient 28

Partial F Tests • Statistically, need to distinguish between • Full regression “no better” than the restricted regression – versus – • Full regression is “significantly better” than the restricted regression • To do this, look at variance of prediction errors • If this declines significantly, then reject H 0 • From ANOVA, we know ratio of two variances has an F distribution • So use F test 29

Partial F Tests • SSresidual = Sum of Squares Residual • C = #constraints • The partial F statistic has C, n-k-1 degrees of freedom • Reject H 0 if F > F , C, n-k-1 30

Coal Mining Example (Again) Regression Statistics R Squared 0. 955 Adj. R Squared 0. 949 Standard Error 108. 052 Obs. 47 ANOVA df SS MS F Significance Regression 6 9975694. 933 1662615. 822 142. 406 0. 000 Residual 40 467007. 875 11675. 197 Total 46 10442702. 809 Coeff. Std. Error t stat p value Lower 95% Upper 95% -168. 510 258. 819 -0. 651 0. 519 -691. 603 354. 583 hours 1. 244 0. 186 6. 565 0. 000 0. 001 0. 002 tons 0. 048 0. 403 0. 119 0. 906 -0. 001 19. 618 5. 660 3. 466 0. 001 8. 178 31. 058 159. 851 78. 218 2. 044 0. 048 1. 766 317. 935 Act 1952 -9. 839 100. 045 -0. 098 0. 922 -212. 038 192. 360 Act 1969 -203. 010 111. 535 -1. 820 0. 076 -428. 431 22. 411 Intercept unemp WWII 31

Minitab Output Predictor Constant hours tons unemp WWII Act 1952 Act 1969 S = 108. 1 Analysis of Source Regression Error Total Coef -168. 5 1. 2235 0. 0478 19. 618 159. 85 -9. 8 -203. 0 St. Dev 258. 8 0. 186 0. 403 5. 660 78. 22 100. 0 111. 5 R-Sq = 95. 5% Variance DF SS 6 9975695 40 467008 46 10442703 T -0. 65 6. 56 0. 12 3. 47 2. 04 -0. 10 -1. 82 P 0. 519 0. 000 0. 906 0. 001 0. 048 0. 922 0. 076 R-Sq(adj) = 94. 9% MS 1662616 11675 F 142. 41 P 0. 000 32

Is the Overall Model Significant? H 0: 1 = 2 = 3 = … = 6 = 0 HA: at least one i 0 • Note: for testing the overall model, C=k • i. e. , testing all coefficients together • From the previous slides, we have SSresidual for the “full” (or unconstrained) model • SSresidual=467, 007. 875 • But what about for the restricted (H 0 true) regression? • Estimate a constant only regression 33

Constant-Only Model Regression Statistics R Squared 0 Adj. R Squared 0 Standard Error 476. 461 Obs. 47 ANOVA df SS MS Regression 0 0 0 Residual 46 10442702. 809 227015. 278 Total 46 10442702. 809 Coeff. Std. Error t stat p value Lower 95% Upper 95% 671. 937 69. 499 9. 668 0. 0000 532. 042 811. 830 Intercept F Significance. . 34

Partial F Tests = 142. 406 H 0: 1 = 2 = 3 = … = 6 = 0 HA: at least one i 0 • Reject H 0 if F > F , C, n-k-1 = F 0. 05, 6, 40 = 2. 34 • 142. 406 > 2. 34 so reject H 0. Yes, overall model is significant 35

Denominator Degrees of Freedom Select F Distribution 5% Critical Values Numerator Degrees of Freedom 1 2 3 4 5 6 … 1 161 199 216 225 230 234 2 18. 5 19. 0 19. 2 19. 3 3 10. 1 9. 55 9. 28 9. 12 9. 01 8. 94 8 5. 32 4. 46 4. 07 3. 84 3. 69 3. 58 10 4. 96 4. 10 3. 71 3. 48 3. 33 3. 22 11 4. 84 3. 98 3. 59 3. 36 3. 20 3. 09 12 4. 75 3. 89 3. 49 3. 26 3. 11 3. 00 18 4. 41 3. 55 3. 16 2. 93 2. 77 2. 66 40 3. 94 3. 09 2. 84 2. 46 2. 31 2. 19 1000 3. 85 3. 00 2. 61 2. 38 2. 22 2. 11 … 36

A Small Shortcut Regression Statistics R Squared 0. 955 Adj. R Squared 0. 949 Standard Error 108. 052 Obs. 47 For constant only model, SSresidual=10, 442, 702. 809 ANOVA df SS MS F Significance Regression 6 9975694. 933 1662615. 822 142. 406 0. 000 Residual 40 467007. 875 11675. 197 Total 46 10442702. 809 Coeff. Std. Error t stat p value Lower 95% Upper 95% -168. 510 258. 819 hours 1. 244 0. 186 tons 0. 048 0. 403 19. 618 5. 660 159. 851 78. 218 2. 044 0. 048 1. 766 317. 935 Act 1952 -9. 839 100. 045 -0. 098 0. 922 -212. 038 192. 360 Act 1969 -203. 010 111. 535 -1. 820 0. 076 -428. 431 22. 411 Intercept unemp WWII -0. 651 0. 519 -691. 603 354. 583 So to test overall model, you 0. 000 0. 001 0. 002 don’t 6. 565 need to run a constant 0. 119 0. 906 -0. 001 only 3. 466 model 0. 001 8. 178 31. 058 37

An Even Better Shortcut Regression Statistics R Squared 0. 955 Adj. R Squared 0. 949 Standard Error 108. 052 Obs. 47 ANOVA df SS MS F Significance Regression 6 9975694. 933 1662615. 822 142. 406 0. 000 Residual 40 467007. 875 11675. 197 Total 46 10442702. 809 Coeff. Std. Error -168. 510 258. 819 hours 1. 244 0. 186 tons 0. 048 0. 403 19. 618 5. 660 159. 851 78. 218 Act 1952 -9. 839 100. 045 -0. 098 0. 922 -212. 038 192. 360 Act 1969 -203. 010 111. 535 -1. 820 0. 076 -428. 431 22. 411 Intercept unemp WWII t stat p value Lower 95% Upper 95% In fact, the ANOVA table F -0. 651 0. 519 -691. 603 354. 583 test is 6. 565 exactly 0. 000 the test 0. 001 for the 0. 002 0. 001 overall 0. 119 model 0. 906 being -0. 001 3. 466 0. 001 8. 178 31. 058 significant—recall Unit 2. 044 0. 048 1. 7668 317. 935 38

Testing Any Subset Regression Statistics R Squared Partial F test can be 0. 955 Adj. R Squared 0. 949 used to test any subset Standard Error 108. 052 of Obs. variables 47 ANOVA df SS MS F Significance Regression 6 9975694. 933 1662615. 822 142. 406 0. 000 Residual 40 467007. 875 11675. 197 Total 46 10442702. 809 Coeff. Std. Error t stat p value Lower 95% Upper 95% -168. 510 258. 819 -0. 651 0. 519 -691. 603 354. 583 Intercept hours 1. 244 tons 0. 048 unemp WWII 19. 618 159. 851 Act 1952 -9. 839 Act 1969 -203. 010 6. 565 0. 000 0. 001 0. 002 For example, 0. 403 0. 119 0. 906 -0. 001 H 0: WWII 3. 466 = Act 1952 5. 660 0. 001 = Act 1969 8. 178 = 0 31. 058 78. 218 2. 044 0. 048 1. 766 317. 935 H : at least one 0 A i 100. 045 -0. 098 0. 922 -212. 038 192. 360 0. 186 111. 535 -1. 820 0. 076 -428. 431 22. 411 39

Restricted Model Regression Statistics R Squared 0. 955 Adj. R Squared 0. 949 Standard Error 108. 052 Obs. Restricted regression with WWII = Act 1952 = Act 1969 = 0 47 ANOVA df SS MS F Significance Regression 3 9837344. 76 3279114. 920 232. 923 0. 000 Residual 43 605358. 049 14078. 094 Total 46 10442702. 809 Coeff. Std. Error t stat p value 147. 821 166. 406 0. 888 0. 379 0. 0015 0. 0001 20. 522 0. 000 -0. 0008 0. 0003 -2. 536 0. 015 7. 298 4. 386 1. 664 0. 103 Intercept hours tons unemp 40

Partial F Tests = 3. 950 H 0: WWII = Act 1952 = Act 1969 = 0 HA: at least one i 0 • Reject H 0 if F > F , C, n-k-1 = F 0. 05, 3, 40 = 2. 84 • 3. 95 > 2. 84 so reject H 0. Yes, subset of three coefficients are jointly significant 41

Blocks Regression and Two-Way ANOVA 1 2 3 4 5 Treatments A B C 10 9 8 12 6 5 18 15 14 20 18 18 8 7 8 “Stack” data using dummy variables A B C B 2 B 3 B 4 B 5 Value 1 0 0 0 10 1 0 0 0 12 1 0 0 0 18 1 0 0 1 0 20 1 0 0 0 1 8 0 1 0 0 0 9 0 1 0 0 0 6 0 1 0 0 15 0 1 0 0 0 18 0 1 1 0 0 0 1 7 0 0 1 0 0 8 … … 42

Recall Two-Way Results ANOVA: Two-Factor Without Replication Source of Variation SS df MS 312. 267 4 Treatment 26. 533 Error Total Blocks F Pvalue F crit 78. 067 38. 711 0. 000 3. 84 2 13. 267 6. 579 0. 020 4. 46 16. 133 8 2. 017 354. 933 14 43

Regression and Two-Way ANOVA Source | SS df MS -----+-----------Model | 338. 800 6 56. 467 Residual | 16. 133 8 2. 017 -------+---------Total | 354. 933 14 25. 352 Number of obs F( 6, 8) Prob > F R-squared Adj R-squared Root MSE = = = 15 28. 00 0. 0001 0. 9545 0. 9205 1. 4201 ------------------------------treatment | Coef. Std. Err. t P>|t| [95% Conf. Int] -----+-------------------------b | -2. 600. 898 -2. 89 0. 020 -4. 671 -. 529 c | -3. 000. 898 -3. 34 0. 010 -5. 071 -. 929 b 2 | -1. 333 1. 160 -1. 15 0. 283 -4. 007 1. 340 b 3 | 6. 667 1. 160 5. 75 0. 000 3. 993 9. 340 b 4 | 9. 667 1. 160 8. 34 0. 000 6. 993 12. 340 b 5 | -1. 333 1. 160 -1. 15 0. 283 -4. 007 1. 340 _cons | 10. 867. 970 11. 20 0. 000 8. 630 13. 104 ------------------------------44

Regression and Two-Way ANOVA Regression Excerpt for Full Model Source | SS df MS -----+---------Model | 338. 800 6 56. 467 Residual | 16. 133 8 2. 017 -----+---------Total | 354. 933 14 25. 352 Regression Excerpt for b 2= b 3 =… 0 Source | SS df MS -----+---------Model | 26. 533 2 13. 267 Residual | 328. 40 12 27. 367 -----+---------Total | 354. 933 14 25. 352 Use these SSresidual values to do partial F tests and you will get exactly the same answers as the Two. Way ANOVA tests Regression Excerpt for b= c = 0 Source | SS df MS -----+---------Model | 312. 267 4 78. 067 Residual | 42. 667 10 4. 267 -----+---------Total | 354. 933 14 25. 352 45

Denominator Degrees of Freedom Select F Distribution 5% Critical Values 1 1 161 2 18. 5 3 10. 1 8 5. 32 10 4. 96 11 4. 84 12 4. 75 18 4. 41 40 3. 94 1000 3. 85 3. 84 Numerator Degrees of Freedom 2 3 4 5 6 9 199 216 225 230 234 241 19. 0 19. 2 19. 3 19. 4 9. 55 9. 28 9. 12 9. 01 8. 94 8. 81 4. 46 4. 07 3. 84 3. 69 3. 58 3. 39 4. 10 3. 71 3. 48 3. 33 3. 22 3. 02 3. 98 3. 59 3. 36 3. 20 3. 09 2. 90 3. 89 3. 49 3. 26 3. 11 3. 00 2. 80 3. 55 3. 16 2. 93 2. 77 2. 66 2. 46 3. 09 2. 84 2. 46 2. 31 2. 19 2. 12 3. 00 2. 61 2. 38 2. 22 2. 11 1. 89 3. 00 2. 60 2. 37 2. 21 2. 10 1. 83 … 46

3 Seconds of Calculus 47

Regression Coefficients • y = b 0 + b 1 x (linear form) 1 unit change in x changes y by b 1 • log(y) = b 0 + b 1 x (semi-log form) 1 unit change in x changes y by b 1 (x 100) percent • log(y) = b 0 + b 1 log(x) (double-log form) 1 percent change in x changes y by b 1 percent 48

Log Regression Coefficients • wage = 9. 05 + 1. 39 union • Predicted wage is $1. 39 higher for unionized workers (on average) • log(wage) = 2. 20 + 0. 15 union • Semi-elasticity • Predicted wage is approximately 15% higher for unionized workers (on average) • log(wage) = 1. 61 + 0. 30 log(profits) • Elasticity • A one percent increase in profits increases predicted wages by approximately 0. 3 percent 49

Multicollinearity Auto repair records, weight, and engine size Number of obs = 69 F( 2, 66) = 6. 84 Prob > F = 0. 0020 R-squared = 0. 1718 Adj R-squared = 0. 1467 Root MSE =. 91445 -----------------------repair | Coef. Std. Err. t P>|t| -------+-------------------weight | -. 00017. 00038 -0. 41 0. 685 engine | -. 00313. 00328 -0. 96 0. 342 _cons | 4. 50161. 61987 7. 26 0. 000 -----------------------50

Multicollinearity • Two (or more) independent variables are so highly correlated that a multiple regression can’t disentangle the unique contributions of each • Large standard errors and lack of statistical significance for individual coefficients • But joint significance • Identifying multicollinearity • Some say “rule of thumb |r|>0. 70” (or 0. 80) • But better to look at results • OK for prediction • Bad for assessing theory 51

Prediction With Multicollinearity • Prediction at the Mean (weight=3019 and engine=197) Model for prediction Lower Upper Predicted 95% Limit (Mean) Repair (Mean) Multiple Regression 3. 411 3. 191 3. 631 Weight Only 3. 412 3. 193 3. 632 Engine Only 3. 410 3. 192 3. 629 52

Dummy Dependent Variables • Dummy dependent variables • y = b 0 + b 1 x 1 + … + bkxk + e • Where y is a {0, 1} indicator variable • Examples • Do you intend to quit? yes / no • Did the worker receive training? yes/no • Do you think the President is doing a good job? yes/no • Was there a strike? yes / no • Did the company go bankrupt? yes/no 53

Linear Probability Model • Mathematically / computationally, can estimate a regression as usual (the monkeys won’t know the difference) • This is called a “linear probability model” • Right-hand side is linear • And is estimating probabilities • P(y =1) = b 0 + b 1 x 1 + … + bkxk • b 1=0. 15 (for example) means that a one unit change in x 1 increases probability that y=1 by 0. 15 (fifteen percentage points) 54

Linear Probability Model • Excel won’t know the difference, but perhaps it should • Linear probability model problems • e 2 = P(y=1) [1 -P(y=1)] • But P(y =1) = b 0 + b 1 x 1 + … + bkxk • So e 2 is • Predicted probabilities are not bounded by 0, 1 • R 2 is not an accurate measure of predictive ability • Can use a pseudo-R 2 measure • Such as percent correctly predicted 55

Logit Model & Probit Model • Solution to these problems is to use nonlinear functional forms that bound P(y=1) between 0, 1 • Logit Model (logistic regression) Recall, ln(x) = a when ea = x • Probit Model • Where is the normal cumulative distribution function 56

Logit Model & Probit Model • Nonlinear so need statistical package to do the calculations • Can do individual (z-tests, not t-tests) and joint statistical testing as with other regressions • Also confidence intervals • Need to convert coefficients to marginal effects for interpretation • Should be aware of these models • Though in many cases, a linear probability model works just fine 57

Example • Dep. Var: 1 if you know of the FMLA, 0 otherwise Probit estimates Number of obs = 1189 LR chi 2(14) = 232. 39 Prob > chi 2 = 0. 0000 Log likelihood = -707. 94377 Pseudo R 2 = 0. 1410 ------------------------------FMLAknow | Coef. Std. Err. z P>|z| [95% Conf. Int] -----+-------------------------union |. 238. 101 2. 35 0. 019. 039. 436 age | -. 002. 018 -0. 13 0. 897 -. 038. 033 agesq |. 135. 219 0. 62 0. 536 -. 293. 564 nonwhite | -. 571. 098 -5. 80 0. 000 -. 764 -. 378 income | 1. 465. 393 3. 73 0. 000. 696 2. 235 incomesq | -5. 854 2. 853 -2. 05 0. 040 -11. 45 -. 262 [other controls omitted] _cons | -1. 188. 328 -3. 62 0. 000 -1. 831 -. 545 ------------------------------58

Marginal Effects • For numerical interpretation / prediction, need to convert coefficients to marginal effects • Example: Logit Model • So b 1 gives effect on Log( • ), not P(y=1) • Probit is similar • Can re-arrange to find out effect on P(y=1) • Usually do this at the sample means 59

Marginal Effects Probit estimates Number of obs = 1189 LR chi 2(14) = 232. 39 Prob > chi 2 = 0. 0000 Log likelihood = -707. 94377 Pseudo R 2 = 0. 1410 ------------------------------FMLAknow | d. F/dx Std. Err. z P>|z| [95% Conf. Int] -----+-------------------------union |. 095. 040 2. 35 0. 019. 017. 173 age | -. 001. 007 -0. 13 0. 897 -. 015. 013 agesq |. 054. 087 0. 62 0. 536 -. 117. 225 Nonwhite | -. 222. 036 -5. 80 0. 000 -. 293 -. 151 income |. 585. 157 3. 73 0. 000. 278. 891 incomesq | -2. 335 1. 138 -2. 05 0. 040 -4. 566 -. 105 [other controls omitted] ------------------------------ For numerical interpretation / prediction, need to convert coefficients to marginal effects 60

But Linear Probability Model is OK, Too Probit Coeff. 0. 238 (0. 101) Probit Marginal 0. 095 (0. 040) Regression 0. 084 (0. 035) Nonwhite -0. 571 (0. 098) -0. 222 (0. 037) -0. 192 (0. 033) Income 1. 465 (0. 393) 0. 585 (0. 157) 0. 442 (0. 091) Income Squared -5. 854 (2. 853) -2. 335 (1. 138) -1. 354 (0. 316) Union So regression is usually OK, but should still be familiar with logit and probit methods 61