Logistic regression Who survived Titanic The sinking of

  • Slides: 27
Download presentation
Logistic regression Who survived Titanic?

Logistic regression Who survived Titanic?

The sinking of Titanic sank April 14 th 1912 with 2228 souls 705 survived.

The sinking of Titanic sank April 14 th 1912 with 2228 souls 705 survived. A dataset of 1309 passengers survived. Who survived? 2

The data pclass survived name sex 1 1 Allen, Miss. Elisabeth Walton female 1

The data pclass survived name sex 1 1 Allen, Miss. Elisabeth Walton female 1 1 Allison, Master. Hudson Trevor male 1 0 Allison, Miss. Helen Loraine female 1 0 Allison, Mr. Hudson Joshua Creighton 1 0 1 age sibsp parch 29 0 0 0. 9167 1 2 2 1 2 male 30 1 2 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25 1 2 1 Anderson, Mr. Harry male 48 0 0 1 1 Andrews, Miss. Kornelia Theodosia female 63 1 0 Andrews, Mr. Thomas Jr male 39 0 0 1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson) female 53 2 0 Sibsp is the number of siblings and/or spouses accompanying Parsc is the number of parents and/or children accompanying Some values are missing Can we predict who will survive titanic II? 3

Analyzing the data in a (too) simple manner • Associations between factors without considering

Analyzing the data in a (too) simple manner • Associations between factors without considering interactions 4

Analyzing the data in a (too) simple manner • Associations between factors without considering

Analyzing the data in a (too) simple manner • Associations between factors without considering interactions 5

Analyzing the data in a (too) simple manner • Associations between factors without considering

Analyzing the data in a (too) simple manner • Associations between factors without considering interactions 6

Analyzing the data in a (too) simple manner • Associations between factors without considering

Analyzing the data in a (too) simple manner • Associations between factors without considering interactions 7

Analyzing the data in a (too) simple manner • Associations between factors without considering

Analyzing the data in a (too) simple manner • Associations between factors without considering interactions 8

Could we use multiple linear regression to predict survival? multiple linear regression Logistic regression

Could we use multiple linear regression to predict survival? multiple linear regression Logistic regression Response variable is defined between –inf and +inf Response variable is defined between 0 and 1 Normal distributed Bernoulli distributed 9

Logit transformation is modeled linearly The logistic function 10

Logit transformation is modeled linearly The logistic function 10

The sigmodal curve 11

The sigmodal curve 11

The sigmodal curve • The intercept basically just ‘scale’ the input variable 12

The sigmodal curve • The intercept basically just ‘scale’ the input variable 12

The sigmodal curve • • The intercept basically just ‘scale’ the input variable Large

The sigmodal curve • • The intercept basically just ‘scale’ the input variable Large regression coefficient → risk factor strongly influences the probability 13

The sigmodal curve • • • The intercept basically just ‘scale’ the input variable

The sigmodal curve • • • The intercept basically just ‘scale’ the input variable Large regression coefficient → risk factor strongly influences the probability Positive regression coefficient → risk factor increases the probability 14

Logistic regression of the Titanic data 15

Logistic regression of the Titanic data 15

Logistic regression of the Titanic data 1. Summary of data 2. Coding of the

Logistic regression of the Titanic data 1. Summary of data 2. Coding of the dependent variable 3. Coding of the categorical explanatory variable: 4. First class: 1 5. Second class: 2 6. Third class: reference 16

Logistic regression of the Titanic data • • • A fit of the null-model,

Logistic regression of the Titanic data • • • A fit of the null-model, basically just the intercept. Usually not interesting The total probability of survival is 500/1309 = 0. 382. Cutoff is 0. 5 so all are classified as non -survivers. Basically tests if the null-model is sufficient. It almost certainly is not. Shows that survival is related to pclass (which is not in the null-model) 17

Logistic regression of the Titanic data 1. Omnibus test: Uses LR to describe if

Logistic regression of the Titanic data 1. Omnibus test: Uses LR to describe if the adding the pclass variable to the model makes it better. It did! But better than the null-model, so no surprise. 2. Model Summary. Other measures of the goodness of fit. 3. Classification table: By including pclass 67. 7 passengers were correctly categorized. 4. Variables in the equation: first line repeats that pclass has a significant effect on survival. B is the logistic fittet parameter. Exp(B) is the odds rations, so the odds of survival is 4. 7 (3. 6 -6. 3) times higher than passengers on third class (reference class) 18

Logistic regression of the Titanic data now adding family relations 1. ‘ 3 or

Logistic regression of the Titanic data now adding family relations 1. ‘ 3 or more’ is set as reference groups by SPSS 19

Logistic regression of the Titanic data now adding family relations 1. The model correctly

Logistic regression of the Titanic data now adding family relations 1. The model correctly classify 79. 1% of the passengers 20

Logistic regression of the Titanic data now adding family relations 1. Basically all factors

Logistic regression of the Titanic data now adding family relations 1. Basically all factors seems to affect the probability of survival. 21

How was it with age? • • • Linear associations are easy to model,

How was it with age? • • • Linear associations are easy to model, because the factor enters the predictive value directly. But it is not really look linear, maybe a third order polynomial? Three new factors for age is calculated: first, second, and third order of the age divided by the standard diviation. 22

How was it with age? • The third-order age factor did not add significantly

How was it with age? • The third-order age factor did not add significantly to the model. • By adding third order polynomial the model can correctly categorize 79. 4 vs 79. 1 before. • Par. Child is no longer a significant factor and can be omitted from the model 23

Using the model to predict survival • Omitting the second and third order age

Using the model to predict survival • Omitting the second and third order age and Par. Child factors • What is the probability that a 25 year old woman accompanied only by her husband holding a second class ticket would survive Titanic? z = -3. 929 -0. 589*(-5)/14. 41 +1. 718 +2. 552 +0. 926 = 1. 4714 24

Analysing interaction of selected factors pclass * sex, age * sex, pclass * Siblings/Parents

Analysing interaction of selected factors pclass * sex, age * sex, pclass * Siblings/Parents But the model does not converge… 25

Analysing interaction of selected factors Collapsing the sibling/spouse number eradicated their mutual interaction 26

Analysing interaction of selected factors Collapsing the sibling/spouse number eradicated their mutual interaction 26

Is it realistic that Leonardo survives and the chick dies? 27

Is it realistic that Leonardo survives and the chick dies? 27