Logistic Regression Modeling with Dichotomous Dependent Variables A
Logistic Regression Modeling with Dichotomous Dependent Variables
A New Type of Model… • Dichotomous Dependent Variable: – Why did someone vote for Bush or Kerry? – Why did residents own or rent their houses? – Why do some people drink alcohol and others don’t? – What determined if a household owned a car?
Dependent Variable… • Is binary, with a yes or a no answer • Can be coded, 1 for yes and 0 for no. • There are no other valid responses.
Problem: OLS Regression does not model the relationship well
Solution: Use a Different Functional Form • The properties we need: – The model should be bounded by 0 and 1 – The model should estimate a value for the dependent variable in terms of the probability of being in one category or the other, e. g. , a owner or renter; or a Bush voter or Kerry voter
Solution, cont. • We want to know the probability, p, that a particular case falls in the 0 or the 1 category. • We want to derive a model which gives good estimates of 0 and 1, or put another way, that a particular case is likely to be a 0 or a 1.
Solution: A Logistic Curve
The Logistic Function • Probability that a case is a 0 or a 1 is distributed according to the logistic function.
Remember probabilities… • Probabilities range from 0 to 1. • Probability: frequency of being in one category relative to the total of all categories. – Example: The probability that the first card dealt in a card game is a queen of hearts is 1/52 (one in 52). • It does us no good to “predict” a value of. 5 as in the linear regression model.
But can we manipulate probabilities to estimate the logistic function? • Steps: – Convert probabilities to odds ratios – Convert odds ratios to log odds or logits
Manipulating probabilities to estimate the logistic function LIST V 2 V 3 V 4 V 5 /N=13 Case number P 1 -P P/1 -P ln(P/1 -P) 1 0. 010 0. 990 0. 010 -4. 595 2 0. 050 0. 950 0. 053 -2. 944 3 0. 100 0. 900 0. 111 -2. 197 4 0. 200 0. 800 0. 250 -1. 386 5 0. 300 0. 700 0. 429 -0. 847 6 0. 400 0. 667 -0. 405 7 0. 500 1. 000 0. 000 8 0. 600 0. 400 1. 500 0. 405 9 0. 700 0. 300 2. 333 0. 847 10 0. 800 0. 200 4. 000 1. 386 11 0. 900 0. 100 9. 000 2. 197 12 0. 950 0. 050 19. 000 2. 944 13 0. 990 0. 010 99. 000 4. 595
Logistic Function
Logistic Function
Steps…. • Log odds = a + bx • Odds ratio = Exponentiate (a + bx) • Probability is distributed according to the logistic function
An Example • Determinants of Homeownership: – Age of the householder squared – Building Type – Year house was built – Householder’s Ethnicity – Occupational status scale
Calculating the Model • Maximum Likelihood Estimation (not OLS) • Estimates of the b’s, standard errors, t ratios and p values for coefficients • Coefficients are estimates of the impact of the independent variable on the logit of the dependent variable
Logistic Regression Model • • • Parameter 1 CONSTANT 2 AGE 3 AGESQ 4 BLDGTYP 2$_cottage 5 BLDGTYP 2$_duplex 6 YEAR 7 GERMAN 8 POLISH 9 OCCSCALE Estimate -6. 976 0. 250 -0. 002 0. 036 -1. 432 0. 061 0. 706 0. 777 0. 190 S. E. 1. 501 0. 060 0. 001 0. 277 0. 328 0. 022 0. 264 0. 422 0. 091 t-ratio -4. 647 4. 132 -3. 400 0. 131 -4. 363 2. 757 2. 677 1. 841 2. 074 p-value 0. 000 0. 001 0. 895 0. 000 0. 006 0. 007 0. 066 0. 038
Logistic Regression model, cont. • • • • Parameter Odds Ratio Upper Lower 2 AGE 1. 284 1. 445 1. 140 3 AGESQ 0. 998 0. 999 0. 997 4 BLDGTYP 2$_cottage 1. 037 1. 784 0. 603 5 BLDGTYP 2$_duplex 0. 239 0. 454 0. 125 6 YEAR 1. 063 1. 109 1. 018 7 GERMAN 2. 026 3. 398 1. 208 8 POLISH 2. 175 4. 972 0. 951 9 OCCSCALE 1. 209 1. 446 1. 011 Log Likelihood of constants only model = LL(0) = -303. 864 2*[LL(N)-LL(0)] = 85. 180 with 8 df Chi-sq p-value = 0. 000 Mc. Fadden's Rho-Squared = 0. 140
Converting Odds Ratios to Probabilities • Odds ratio = P/1 -P. • For Germans, compared with the omitted category (Americans and other ethnicities) controlling for other variables, 2. 026 = P/(1 -P) • Germans are more likely to own houses than Americans. • Can we be more specific?
Calculating Probability of a Case • Log odds of homeownership = -6. 976 +. 250 Age -. 002 Agesquared +. 036 cottage – 1. 432 duplex +. 061 Year +. 706 German +. 777 Polish +. 190 occscale • Plug in values and solve the equation. • Exponentiate the result to create the odds • Convert the odds to a probability for the case.
Calculations • Log odds of homeownership = -6. 976 +. 250 Age -. 002 Agesquared +. 036 cottage – 1. 432 duplex +. 061 Year +. 706 German +. 777 Polish +. 190 occscale • For a 40 year old skilled, American born worker, living in a residence built in 1892: • Log odds of homeownership = -6. 976 +. 250*40 -. 002*1600 +. 061* 5 +. 190*3 • Log odds =. 699
Calculations, cont. • log odds =. 699 • odds = anti log or exponentiation of. 699 = 2. 012 • odds = P/(1 -P) = 2. 012 • Solve for P. The result is. 67.
More calculations…. • How about a 40 year old German skilled worker in an 1892 residence? • Log odds of homeownership = -6. 976 +. 250 Age. 002 Agesquared +. 036 cottage – 1. 432 duplex +. 061 Year +. 706 German +. 777 Polish +. 190 occscale • Log odds = -6. 976 +. 250*40 -. 002*1600 +. 061* 5 +. 706 +. 190*3 = 1. 405 • Note as well that. 699 +. 706 = 1. 405. • Note as well that. 699 * 2. 026 (or the odds ratio for the variable “German”) = 1. 405
More calculations • Convert the log odds to odds, e. g. , take the antilog of 1. 405 = 4. 076. • Odds = 4. 076 = P/(1 -P). • Solve for P. P =. 803. • So the probability of the increase in home ownership between Americans and Germans is from. 67 to. 803 or about 13%.
More calculations • For a 30 year old American worker in a residence built in 1892: • Log odds = -6. 976 +. 250*30 -. 002*900 +. 061*5 +. 190*3 = -0. 401 • Odds = Antilog of (-. 401) = 0. 670 • Probability of ownership =. 670/1. 670 = 0. 401
Classification Table • • • • Model Prediction Success Table Actual Choice Predicted Choice Response Reference Pred. Tot. Correct Success Ind. Tot. Correct Sensitivity: False Reference: Actual Total 281. 647 85. 353 58. 647 367. 000 144. 000 367. 000 0. 767 0. 049 0. 666 144. 000 0. 407 0. 125 511. 000 0. 767 0. 233 Specificity: False Response: 0. 407 0. 593
Extending the Logic… • Logistic Regression can be extended to more than 2 categories for the dependent variable, for multi response models • Classification Tables can be used to understand misclassified cases • Results can be analyzed for patterns across different values of the independent variables.
- Slides: 27