Advanced Models and Methods in Behavioral Research Chris
Advanced Models and Methods in Behavioral Research • Chris Snijders • c. c. p. snijders@gmail. com To. Do: Studyweb! Enroll in 0 a 611 • 3 ects • http: //www. chrissnijders. com/ammbr (=studyguide) • literature: Field book + separate course material • laptop exam (+ assignments) Advanced Methods and Models in Behavioral Research – 2011/2012
The methods package • MMBR (6 ects) – Blumberg: algemeen: vraagstelling, betrouwbaarheid, validiteit etc – Field: SPSS: factor analyse, multiple regressie, ANc. OVA, sample size etc • AMMBR (3 ects) - Field (deels): logistische regressie - literatuur via website: conjoint analysis multi-level regression Advanced Methods and Models in Behavioral Research – 2011/2012
Models and methods: topics • t-test, Cronbach's alpha, etc • multiple regression, analysis of (co)variance and factor analysis • logistic regression • conjoint analysis / repeated measures – Stata next to SPSS – “Finding new questions” – Practice data collection (a bit) In the background: “now you should be able to do it on your own” Advanced Methods and Models in Behavioral Research – 2011/2012
Methods in brief (1) • Logistic regression: target Y, predictors Xi. Y is a binary variable (0/1). - Why not just multiple regression? Interpretation is more difficult goodness of fit is non-standard. . . Advanced Methods and Models in Behavioral Research – 2011/2012
Methods in brief (2) • Conjoint analysis Underlying assumption: for each user, the "utility" of a product can be written as -10 Euro p/m - 2 years fixed - free phone -. . . How attractive is this offer to you? U(x 1, x 2, . . . , xn) = c 0 + c 1 x 1 +. . . + cn xn Advanced Methods and Models in Behavioral Research – 2011/2012
Conjoint analysis as an “in between method” Between Which phone do you like and why? What would your favorite phone be? And: Let’s keep track of what people buy. Advanced Methods and Models in Behavioral Research – 2011/2012
Coming up with new ideas (3) “More research is necessary” But on what? YOU: come up with sensible new ideas, given previous research Advanced Methods and Models in Behavioral Research – 2011/2012
Stata next to SPSS • It’s just better • Multi-level regression is much easier than in SPSS • It’s good to be exposed to more than just a single statistics package (your knowledge (faster, better written, more possibilities, better programmable …) should not be based on “where to click” arguments) • More stable • Supports OSX as well… (anybody? ) Advanced Methods and Models in Behavioral Research – 2011/2012 (I think)
But … • Output less “polished” • It takes some extra work to get you started • The Logistic Regression chapter in the Field book uses SPSS (but still readable for the larger part) • (and it’s not campus software, but subfaculty software) • Installation … Advanced Methods and Models in Behavioral Research – 2011/2012
Advanced Methods and Models in Behavioral Research Make sure to • enroll in studyweb (0 a 611) • Read the Field chapter on logistic regression Advanced Methods and Models in Behavioral Research – 2008/2009 – 2011/2012 10
Logistic Regression Analysis That is: your Y variable is 0/1: now what? Credit where credit is due: slides adapted from Gerrit Rooks
The main points 1. Why do we have to know and sometimes use logistic regression? 2. What is the underlying model? What is maximum likelihood estimation? 3. Logistics of logistic regression analysis 1. 2. 3. 4. Estimate coefficients Assess model fit Interpret coefficients Check residuals 4. An SPSS example
Advanced Methods and Models in Behavioral Research – 2011/2012
Suppose we have 100 observations with information about an individuals age and wether or not this indivual had some kind of a heart disease (CHD) ID age CHD 1 2 3 4 … 98 99 100 20 23 24 25 0 0 0 1 64 65 69 0 1 1
A graphic representation of the data CHD Age
Let’s just try regression analysis pr(CHD|age) = -. 54 +. 0218107*Age
. . . linear regression is not a suitable model for probabilities pr(CHD|age) = -. 54 +. 0218107*Age
In this graph for 8 age groups, I plotted the probability of having a heart disease (proportion)
A nonlinear model is probably better here
Something like this
This is the logistic regression model
Predicted probabilities are always between 0 and 1 similar to classic regression analysis
Side note: this is similar to MMBR … Suppose Y is a percentage (so between 0 and 1). Then consider …which will ensure that the estimated Y will vary between 0 and 1 and after some rearranging this is the same as Advanced Methods and Models in Behavioral Research – 2011/2012
… (continued) And one “solution” might be: - Change all Y values that are 0 to 0. 001 - Change all Y values that are 1 to 0. 999 Now run regression on log(Y/(1 -Y)) … … but that doesn’t work so well … Advanced Methods and Models in Behavioral Research – 2011/2012
Logistics of logistic regression 1. 2. 3. 4. How do we estimate the coefficients? How do we assess model fit? How do we interpret coefficients? How do we check regression assumptions?
Kinds of estimation in regression • Ordinary Least Squares (we fit a line through a cloud of dots) • Maximum likelihood (we find the parameters that are the most likely, given our data) We never bothered to consider maximum likelihood in standard multiple regression, because you can show that they lead to exactly the same estimator. OLS does not work well in logistic regression, but maximum likelihood estimation does … Advanced Methods and Models in Behavioral Research – 2011/2012
Maximum likelihood estimation • Method of maximum likelihood yields values for the unknown parameters which maximize the probability of obtaining the observed set of data. Unknown parameters
Maximum likelihood estimation • First we have to construct the likelihood function (probability of obtaining the observed set of data). Likelihood = pr(obs 1)*pr(obs 2)*pr(obs 3)…*pr(obsn) Assuming that observations are independent
Log-likelihood • For technical reasons the likelihood is transformed in the log-likelihood (then you just maximize the sum of the logged probabilities) LL= ln[pr(obs 1)]+ln[pr(obs 2)]+ln[pr(obs 3)]…+ln[pr(obsn)]
Note: optimizing log-likelihoods is difficult • It’s iterative (“searching the landscape”) • it might not converge • it might converge to the wrong answer Advanced Methods and Models in Behavioral Research – 2011/2012
Estimation of coefficients: SPSS Results
This function fits very well, other values of b 0 and b 1 give worse results
Illustration 1: suppose we chose. 05 X instead of. 11 X
Illustration 2: suppose we chose. 40 X instead of. 11 X
Logistics of logistic regression • Estimate the coefficients • Assess model fit – Between model comparisons – Pseudo R 2 (similar to multiple regression) – Predictive accuracy • Interpret coefficients • Check regression assumptions
Model fit: comparisons between models The log-likelihood ratio test statistic can be used to test the fit of a model The test statistic has a chi-square distribution full model reduced model 37
Between model comparisons: likelihood ratio test full model reduced model The model including only an intercept Is often called the empty model. SPSS uses this model as a default.
Between model comparison: SPSS output This is the test statistic, and it’s associated significance
Overall model fit pseudo R 2 log-likelihood of the model that you want to test Just like in multiple regression, pseudo R 2 ranges 0. 0 to 1. 0 – Cox and Snell • cannot theoretically reach 1 – Nagelkerke log-likelihood of model before any predictors were entered • adjusted so that it can reach 1 NOTE: R 2 in logistic regression tends to be (even) smaller than in multiple regression 40
Overall model fit: Classification table We correctly predict 74% of our observations 41
Overall model fit: Classification table 14 cases had a CHD while according to our model this shouldnt have happened 42
Overall model fit: Classification table 12 cases didn’t have a CHD while according to our model this should have happened 43
Logistics of logistic regression • Estimate the coefficients • Assess model fit • Interpret coefficients – Direction – Significance – Magnitude • Check regression assumptions
Interpreting coefficients: direction We can rewrite our model as follows: 45
Interpreting coefficients: direction • original b reflects changes in logit: b>0 implies positive relationship • exponentiated b reflects the changes in odds: exp(b) > 1 implies a positive relationship 46
3. Interpreting coefficients: magnitude • The slope coefficient (b) is interpreted as the rate of change in the "log odds" as X changes … not very useful. • exp(b) is the effect of the independent variable on the odds, more useful for calculating the size of an effect 47
Magnitude of association: Percentage change in odds Probability Odds 25% 0. 33 50% 1 75% 3
Magnitude of association • For the age variable: – Percentage change in odds = (exponentiated coefficient – 1) * 100 = 12%, or “the odds times 1, 117” – A one unit increase in age will result in 12% increase in the odds that the person will have a CHD – So if a soccer player is one year older, the odds that (s)he will have CHD is 12% higher
Another way to get an idea of the size of effects: Calculating predicted probabilities For somebody of 20 years old, the predicted probability is. 04 For somebody of 70 years old, the predicted probability is. 91
But this gets more complicated when you have more than a single X-variable (see blackboard) Conclusion: if you consider the effect of a variable on the predicted probability, the size of the effect of X 1 depends on the value of X 2! Advanced Methods and Models in Behavioral Research – 2011/2012
Testing significance of coefficients estimate t-distribution standard error of estimate • In linear regression analysis this statistic is used to test significance • In logistic regression something similar exists • however, when b is large, standard error tends to become inflated, hence underestimation (Type II errors are more likely) Note: This is not the Wald Statistic SPSS presents!!!
Interpreting coefficients: significance • SPSS presents • While Andy Field thinks SPSS presents this (at least in the 2 nd version of the book):
Advanced Methods and Models in Behavioral Research – 2011/2012
Logistic regression • Y = 0/1 • Multiple regression (or ANc. OVA) is not right • You consider either the odds or the log(odds) • It is estimated through “maximum likelihood” • Interpretation is a bit more complicated than normal Advanced Methods and Models in Behavioral Research – 2011/2012
Advanced Methods and Models in Behavioral Research Make sure to • enroll in studyweb (0 a 611) • Read the Field chapter on logistic regression Advanced Methods and Models in Behavioral Research – 2008/2009 – 2011/2012 56
Advanced Methods and Models in Behavioral Research – 2011/2012
Logistics of logistic regression • • Estimate the coefficients Assess model fit Interpret coefficients Check regression assumptions
Checking assumptions • Influential data points & Residuals – Follow Samanthas tips • Hosmer & Lemeshow – Divides sample in subgroups – Checks whethere are differences between observed and predicted between subgroups – Test should not be significant, if so: indication of lack of fit
Hosmer & Lemeshow Test divides sample in subgroups, checks whether difference between observed and predicted is about equal in these groups Test should not be significant (indicating no difference)
Examining residuals in l. R 1. Isolate points for which the model fits poorly 2. Isolate influential data points
Residual statistics
Cooks distance Prediction for j from all observations Number of parameter Prediction for j for observations excluding observation i Means square error
Illustration with SPSS • Penalty kicks data, variables: – Scored: outcome variable, • 0 = penalty missed, and 1 = penalty scored – Pswq: degree to which a player worries – Previous: percentage of penalties scored by a particulare player in their career 64
SPSS OUTPUT Logistic Regression Tells you something about the number of observations and missings 65
this table is based on the empty model, i. e. only the constant in the model Block 0: Beginning Block these variables will be entered in the model later on 66
Block 1: Method = Enter Block is useful to check significance of individual coefficients, see Field this is the test statistic Note: Nagelkerke is larger than Cox after dividing by -2 New model 67
Block 1: Method = Enter (Continued) Predictive accuracy has improved (was 53%) estimates standard error estimates significance based on Wald statistic change in odds 68
How is the classification table constructed? # cases not predicted corrrectly 69
How is the classification table constructed? pswq previous scored 18 56 1 Predict. prob. . 68 17 35 1 . 41 20 45 0 . 40 10 42 0 . 85 70
How is the classification table constructed? pswq previo us 18 17 20 10 56 35 45 42 scored Predict. predict prob. ed 1 1 0 0 . 68. 41. 40. 85 1 0 0 1 71
- Slides: 71