Logistic Regression Database Marketing Instructor N Kumar Logistic

Logistic Regression Database Marketing Instructor: N. Kumar

Logistic Regression vs TGDA n Two-Group Discriminant Analysis n Implicitly assumes that the Xs are Multivariate Normally (MVN) Distributed n This assumption is violated if Xs are categorical variables n Logistic Regression does not impose any restriction on the distribution of the Xs n Logistic Regression is the recommended approach if at least some of the Xs are categorical variables

Data

Contingency Table Type of Stock Preferred Large Small Total 10 2 12 Not Preferred Total 1 11 12 11 13 24

Basic Concepts n Probability of being a preferred stock = 12/24 = 0. 5 n Probability that a company’s stock is preferred given that the company is large = 10/11 = 0. 909 n Probability that a company’s stock is preferred given that the company is small = 2/13 = 0. 154

Concepts … contd. n Odds of a preferred stock = 12/12 = 1 n Odds of a preferred stock given that the company is large = 10/1 = 10 n Odds of a preferred stock given that the company is small = 2/11 = 0. 182

Odds and Probability n Odds(Event) = Prob(Event)/(1 -Prob(Event)) n Prob(Event) = Odds(Event)/(1+Odds(Event))

Logistic Regression n Take Natural Log of the odds: ln(odds(Preferred|Large)) = ln(10) = 2. 303 n ln(odds(Preferred|Small)) = ln(0. 182) = -1. 704 n n Combining these relationships n ln(odds(Preferred|Size)) = -1. 704 + 4. 007*Size n Log of the odds is a linear function of size n The coefficient of size can be interpreted like the coefficient in regression analysis

Interpretation n Positive sign ln(odds) is increasing in size of the company i. e. a large company is more likely to have a preferred stock vis-à-vis a small company n Magnitude of the coefficient gives a measure of how much more likely

General Model n ln(odds) = 0 + 1 X 1 + 2 X 2 +…+ k. XK n Recall: n Odds = p/(1 -p) n ln(p/1 -p) = 0 + 1 X 1 + 2 X 2 +…+ k. XK (2) n p= (1)

Logistic Function

Estimation n Coefficients in the regression model are estimated by minimizing the sum of squared errors n Since, p is non-linear in the parameter estimates we need a non-linear estimation technique n Maximum-Likelihood Approach n Non-Linear Least Squares

Maximum Likelihood Approach n Conditional on parameter , write out the probability of observing the data n Write this probability out for each observation n Multiply the probability of each observation out to get the joint probability of observing the data condition on n Find the that maximizes the conditional probability of realizing this data

Logistic Regression n Logistic Regression with one categorical explanatory variable reduces to an analysis of the contingency table

Interpretation of Results Look at the – 2 Log L statistic n Intercept only: 33. 271 n Intercept and Covariates: 17. 864 n Difference: 15. 407 with 1 DF (p=0. 0001) n Means that the size variable is explaining a lot

Do the Variables Have a Significant Impact? n Like testing whether the coefficients in the regression model are different from zero n Look at the output from Analysis of Maximum Likelihood Estimates n Loosely, the column Pr>Chi-Square gives you the probability of realizing the estimate in the Parameter estimate column if the estimate were truly zero – if this value is < 0. 05 the estimate is considered to be significant

Other things to Look for n Akaike’s Information Criterion (AIC), Schwartz’s Criterion (SC) – this like Adj. R 2 – so there is a penalty for having additional covariates n The larger the difference between the second and third columns – the better the model fit

Interpretation of the Parameter Estimates n ln(p/(1 -p)) = -1. 705 + 4. 007*Size n p/(1 -p) = e(-1. 705) e(4. 007*Size) n For a unit increase in size, odds of being a favored stock go up by e 4. 007 = 54. 982

Predicted Probabilities and Observed Responses n The response variable (success) classifies an observation into an event or a no-event n A concordant pair is defined as that pair formed by an event with a PHAT higher than that of the no-event n Higher the Concordant pair % the better

Classification n For a set of new observations where you have information on size alone n You can use the model to predict the probability that success = 1 i. e. the stock is favored n If PHAT > 0. 5 success = 1 else success=2

Logistic Regression with multiple independent variables n Independent variables a mixture of continuous and categorical variables

Data

General Model n ln(odds) = 0 + 1 Size + 2 FP n ln(p/1 -p) = 0 + 1 Size + 2 FP n p=

Estimation & Interpretation of the Results n Identical to the case with one categorical variable

Summary n Logistic Regression or Discriminant Analysis n Techniques differ in underlying assumptions about the distribution of the explanatory (independent) variables n Use logistic regression if you have a mix of categorical and continuous variables