Logistic Regression Logistic Regression n When n Just

  • Slides: 28
Download presentation
Logistic Regression

Logistic Regression

Logistic Regression n When ? n Just like multiple regression, but when the dependent

Logistic Regression n When ? n Just like multiple regression, but when the dependent variable is dichotomous. n n E. g. improved or not improved; successful or not successful. Why ? n Logistic regression can be used for classification purpose (it includes c 2). n n Why not performed a discriminant analysis ? n n n Give probability of an effect (outcome) and evaluate the risk (odds). Probability of success outside [0, 1] Normality Why not performed a multiple regression ? n n n Probability of success outside [0, 1] Homoscedasticity Normality

Logistic Regression n Example: n Suppose we want to predict whether someone has a

Logistic Regression n Example: n Suppose we want to predict whether someone has a coronary disease (DV) using age in years (IV). n It is customary to code a binary DV either 0 or 1.

Logistic Regression n The logistic curve Linear part Nonlinear part

Logistic Regression n The logistic curve Linear part Nonlinear part

Logistic Regression n The logistic curve

Logistic Regression n The logistic curve

Logistic Regression n Example: n Suppose we want to predict whether someone has a

Logistic Regression n Example: n Suppose we want to predict whether someone has a coronary disease (DV) using age in years (IV). n It is customary to code a binary DV either 0 or 1.

Logistic Regression n n The logistic curve where is the probability of a 1,

Logistic Regression n n The logistic curve where is the probability of a 1, e is the base of the natural logarithm (about 2. 718) and b is the parameters of the model. b adjusts how quickly the probability changes when X increases by a single unit. Because the relationship between X and is nonlinear, b does not have a straightforward interpretation in this model; contrary to ordinary linear regression.

Logistic Regression n (Where did it came from) n Suppose we only know a

Logistic Regression n (Where did it came from) n Suppose we only know a person's age and we want to predict whether that person has a coronary disease or not. We can talk about the probability of having the disease, or we can talk about the odds of having the disease. Let's say that the probability of not having the disease for a given age is. 95. Then the odds of not having the disease is n Now the odds of having the disease would be. 05/. 95 or 1/19 or 0. 0526. This asymmetry is unappealing, because the odds of having the disease should be the opposite of the odds of not having the disease.

Logistic Regression n (Where did it came from) n We can take care of

Logistic Regression n (Where did it came from) n We can take care of this asymmetry by using the natural logarithm, ln. The natural log of 19 is 2. 9444 (ln(0. 95/0. 05)=2. 9444). The natural log of 1/19 is - 2. 9444 (ln(0. 05/0. 95)=-2. 9444), so the log odds of having a coronary disease is exactly the opposite of the log odds of not having a disease. In term of odds Solving for In term of probability

Logistic Regression n Finding the regression weights. n In multiple regression, we wanted to

Logistic Regression n Finding the regression weights. n In multiple regression, we wanted to minimize the residual sum of squares. n With the logistic curve, there is no mathematical solution that will produce least squares estimates of the parameters. We will use instead the maximum (log) likelihood. A likelihood is a conditional probability: P( |X), the probability of given X). The idea is to choose the regression weights that will give the maximum (log) likelihood between the data and the logistic curve. n Maximum likelihood Maximum log likelihood

Logistic Regression n Finding the regression weights. n The maximum of this expression can

Logistic Regression n Finding the regression weights. n The maximum of this expression can then be found numerically using an optimization algorithm

Logistic Regression n Finding the regression weights. n The maximum of this expression can

Logistic Regression n Finding the regression weights. n The maximum of this expression can then be found numerically using an optimization algorithm

Logistic Regression n Finding the regression weights. n The maximum of this expression can

Logistic Regression n Finding the regression weights. n The maximum of this expression can then be found numerically using an optimization algorithm

Logistic Regression n Hypothesis testing n The idea is to compare the full model

Logistic Regression n Hypothesis testing n The idea is to compare the full model with only the constant using chi-square. There is only 1 predictor n This indicates that age can reliably distinguished between people having a coronary disease from those who do not.

Logistic Regression n Hypothesis testing n We can use the same idea to build

Logistic Regression n Hypothesis testing n We can use the same idea to build a regression model. n Also, the Wald statistic can be used (Z test). Sometimes a chi-square is used n Fisher information matrix

Logistic Regression n Hypothesis testing n Also, the Wald statistic can be used Constant

Logistic Regression n Hypothesis testing n Also, the Wald statistic can be used Constant IV (coronary disease)

Logistic Regression n Explained variability n There are three popular measures that approximate the

Logistic Regression n Explained variability n There are three popular measures that approximate the variance interpretation found in linear regression (R 2).

Logistic Regression n Odds Ratio (OR) n n n The odds ratio is the

Logistic Regression n Odds Ratio (OR) n n n The odds ratio is the increase (or decrease) in odds of being in one outcome category when the value of the predictor increases by on unit. If the odds are the same across groups, then OR=1. If the odds are greater than 1, then there is an increase probability of being classify into the category. If the odds are smaller than 1, then there is a decrease probability of being classify into the given category. Thus, at each of my birthdays I increase my odds of having a coronary disease by 1. 12. In other words, each year I increase the risk of developing a coronary disease by 12 percents.

Logistic Regression n Odds Ratio (OR) n n For a 5 year age difference,

Logistic Regression n Odds Ratio (OR) n n For a 5 year age difference, say, the increase is exp(b)5 [= 1. 117315] = 1. 74, or a 74% increase. Classification table n Cut off = 0. 5 Constant only Total correct percentage = 57 All predictors Total correct percentage = 74

Logistic Regression n Prediction n If I have (x’=)50 years old, what is my

Logistic Regression n Prediction n If I have (x’=)50 years old, what is my probability of having a coronary disease ?

Logistic Regression n Confidence intervals n CI=0. 95

Logistic Regression n Confidence intervals n CI=0. 95

Logistic Regression n Confidence bands n CI=0. 95

Logistic Regression n Confidence bands n CI=0. 95

Logistic Regression n Recoding a continuous variable into a dichotomous variable n n Cutoff

Logistic Regression n Recoding a continuous variable into a dichotomous variable n n Cutoff at 55 Contingency table

Logistic Regression n Recoding a continuous variable into a dichotomous variable n Cutoff at

Logistic Regression n Recoding a continuous variable into a dichotomous variable n Cutoff at 55 n Regression weights n Wald test

Logistic Regression n Recoding a continuous variable into a dichotomous variable n n Cutoff

Logistic Regression n Recoding a continuous variable into a dichotomous variable n n Cutoff at 55 Explained variability

Logistic Regression n Recoding a continuous variable into a dichotomous variable n n Cutoff

Logistic Regression n Recoding a continuous variable into a dichotomous variable n n Cutoff at 55 Classification table Total correct percentage = 57 Total correct percentage = 72

Logistic Regression n Recoding a continuous variable into a dichotomous variable n n Cutoff

Logistic Regression n Recoding a continuous variable into a dichotomous variable n n Cutoff at 55 Odds ratio n If I am 55 years old and up, I have 8 times more chances to have a coronary disease.

Logistic Regression n Recoding a continuous variable into a dichotomous variable n n Cutoff

Logistic Regression n Recoding a continuous variable into a dichotomous variable n n Cutoff at 55 Confidence intervals n The CI (0. 95) is asymmetric. It suggests that coronary disease is 2. 9 to 22. 9 more likely to occur if I am 55 yrs and up.