Logistic Regression and Generalized Linear Models Blood Screening
Logistic Regression and Generalized Linear Models: Blood Screening, Women’s Role in Society, and Colonic Polyps
Blood Screening • ESR measurements (erythrocyte sedimentation rate) – Study checks for the rate of increase of ESR when (blood proteins) fibrinogen and globulin increase • Looking for an association between the probability of an ESR reading > 20 mm/hr and the levels of the two plasma proteins. • Less than 20 mm/hr indicates a healthy individual
Multiple Regression Doesn’t Work • Not Normally distributed – The response variable is binary. – In fact, the distribution is Binomial. The can be seen by looking at how the error term relates to the probability. If y = 1 then the error term is 1 - P(y=1). We will assume for our Random Variables Y that
Logistic Regression, a Generalized Linear Model • Modelling the expected value of the response requires a transformation – This chapter is about the logit transformation of a model, which is of the form • If the response variable is 1, the log odds of the response is the logit function of a probability. • Fixing all explanatory variables but xj, exp(Bj) represents the odds that the response variable is 1 when xj increases by 1.
R commands • Plasma_glm_1 <- glm(ESR ~ fibrinogen, data = plasma, family = binomial()) – Fits the model, in this example specifying the logistic function is implied because of the binomial() parameter is already passed. • Layout(matrix(1: 2, ncol = 2)) – Lets you choose the layout of your graph screen. • Cdplot(ESR ~ fibrinogen, data = plasma) • Cdplot(ESR ~ globulin, data = plasma) – cdplot “plots conditional densities describing how the conditional distribution of a categorical variable y changes over a numerical variable x” (help(“cdplot”))
Interpretation • The area of the dark region is the probability that ESR < 20 mm/hr, and this decreases as the protein levels increase • There is not much shape to the globulin density function.
R Commands • Confint(plasma_glm_1, parm = “fibrinogen”) – Gives a confidence interval – Output: • Waiting for profiling to be done. . . • 2. 5 % 97. 5 % • 0. 3389465 3. 9988602 • Exp(coef(plasma_glm_1)[“fibrinogen”]) – This is necessary to do the reverse transformation to get the model coefficients – Output: • fibrinogen • 6. 215715
R Commands • Summary(plasma_glm_1) – Output: • Deviance Residuals: • Min 1 Q Median 3 Q Max • -0. 9298 -0. 5399 -0. 4382 -0. 3356 2. 4794 • • • Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6. 8451 2. 7703 -2. 471 0. 0135 * fibrinogen 1. 8271 0. 9009 2. 028 0. 0425 * --Signif. codes: 0 ‘***’ 0. 001 ‘**’ 0. 01 ‘*’ 0. 05 ‘. ’ 0. 1 ‘ ’ 1 • (Dispersion parameter for binomial family taken to be 1) • Null deviance: 30. 885 on 31 degrees of freedom • Residual deviance: 24. 840 on 30 degrees of freedom
R Commands • Exp(confint(plasma_glm_1, parm = “fibrinogen”)) – Output: • 2. 5 % 97. 5 % • 1. 403468 54. 535954 • Plasma_glm_2 <- glm(ESR ~ fibrinogen + globulin, data = plasma, family = binomial()) – Use logistic regression for both variables fibrinogen and globulin • Summary(Plasma_glm_2)
R Commands • Summary(Plasma_glm_2) – Output • Deviance Residuals: • Min 1 Q Median 3 Q Max • -0. 9683 -0. 6122 -0. 3458 -0. 2116 2. 2636 • • Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -12. 7921 5. 7963 -2. 207 0. 0273 * fibrinogen 1. 9104 0. 9710 1. 967 0. 0491 * globulin 0. 1558 0. 1195 1. 303 0. 1925 --Signif. codes: 0 ‘***’ 0. 001 ‘**’ 0. 01 ‘*’ 0. 05 ‘. ’ 0. 1 ‘ ’ 1 • (Dispersion parameter for binomial family taken to be 1) • Null deviance: 30. 885 on 31 degrees of freedom • Residual deviance: 22. 971 on 29 degrees of freedom
Interpretation • large confidence interval is because there are not many observations in all, and even fewer ESR > 20 mm/hr • The globulin coefficient is basically zero
R Commands • Anova(plasma_glm_1, plasma_glm_2, test = “Chisq”) – Compares the two models one of just fibrinogen, the other of both fibrinogen and globulin – Why Chisquared test? • ANOVA assumes normal distribution for each data set. Why is this OK? – Output • Analysis of Deviance Table • Model 1: ESR ~ fibrinogen • Model 2: ESR ~ fibrinogen + globulin • Resid. Df Resid. Dev Df Deviance P(>|Chi|) • 1 30 24. 8404 • 2 29 22. 9711 1 1. 8692 0. 1716 • Comparing the residual deviance between the two models, we see that the difference is not significant. We must surmise that there is no association between globulin and ESR level.
R Commands • Prob <- predict(plasma_glm_1, type = “response” – The model of just fibrinogen useful in creating a neat looking bubble plot. First we must take the predicted values from the first model and use them to determine the size of the bubbles • Plot(globulin ~ fibrinogen, data = plasma, xlim = c(2, 6), ylim = c(25, 50), pch = “. ”) – Plots the second model • Symbols(plasma$fibrinogen, plasma$globulin, circles = prob, add = TRUE) – Uses the values of the first model to create different bubble sizes.
Bubbles size is Probability • The plot shows how the probability of having an ESR > 20 mm/hr increases as fibrinogen increases
Generalized Linear Model • Unifies the logistic regression, Analysis of Variance and multiple regression techniques. • GLMs have three essential parts – An error distribution- the distribution of the response variable. • Normal for Analysis of Variance and Multiple Regression • Binomial for Logistic Regression
Main parts of GLM (continued) • A link function which links the explanatory variables to the expected value of the response. – Logit function for logistic regression – Identity function for ANOVA and multiple regression • A variance function which shows the dependency of the response variable variability on the mean
Measure of Fit • The deviance shows how well the model fits the data • Comparing two model’s deviances – Use a likelihood ratio test – Compare using Chi-square distribution
Women’s Role in Society • Response to “Women should take care of running their homes and leave the running the country up to men”. • Factors: Education, Sex • Response: Agree or Disagree – data is presented as categories with counts for each education, sex combination
R Commands • Womensrole_glm_1 <- glm(cbind(agree, disagree) ~ sex + education, data = womensrole, family= binomial()) – This uses the cbind function to change data from two responses to one response that is a matrix of agree, disagree counts. • Summary(womensrole_glm_1) – We can see that education is a significant factor to the response.
Summary Output • • • Deviance Residuals: Min 1 Q Median 3 Q Max -2. 72544 -0. 86302 -0. 06525 0. 84340 3. 13315 • • Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2. 50937 0. 18389 13. 646 <2 e-16 *** Why is there a P-value for the intercept? sex. Female -0. 01145 0. 08415 -0. 136 0. 892 education -0. 27062 0. 01541 -17. 560 <2 e-16 *** --Signif. codes: 0 ‘***’ 0. 001 ‘**’ 0. 01 ‘*’ 0. 05 ‘. ’ 0. 1 ‘ ’ 1 • (Dispersion parameter for binomial family taken to be 1) • • Null deviance: 451. 722 on 40 degrees of freedom Residual deviance: 64. 007 on 38 degrees of freedom
Declaring a function in R • Myplot <- function(role. fitted) { – Declares the function myplot and passes it an object “role. fitted” to be used in the function • f <- womensrole$sex == “Female” – Stores everything in the data with sex = femail in f • plot(womensrole$education, role. fitted, type = "n", ylab = "probability of agreeing", xlab = "Education", ylim = c(0, 1)) – Plots education against the role. fitted object which will be some predicted values from our GLM defined as • Role. fitted 1 <- predict(womensrole_glm_1, type = “response”)
Myplot function (cont) • lines(womensrole$education[!f], role. fitted[!f], lty = 1) • lines(womensrole$education[f], role. fitted[f], lty = 2) – These graph the lines, one for males and one for femails. Lty indicates the kind of line (in this case solid or dotted) • lgtxt <- c("Fitted (Males)", "Fitted (Females)") • legend("topright", lgtxt, lty = 1: 2, bty = "n") – These add a legend for each line.
Myplot Function (cont) • y<-womensrole$agree/ (womensrole$agree + womensrole$disagree) – A basic calculation of the proportion of women that agree • text(womensrole$education, y, ifelse(f, "\VE", "\MA"), family = "Hershey. Serif", cex = 1. 25) – Plots y and not y using male/female symbols
Interpretation • The two fitted lines indicate that sex does not change a probability of agreeing vs education • The symbols of unfitted observations may indicate an interaction between sex and education
My. Plot Sex: Education • By running the same analysis, e. g. – Womensrole_glm_2 <glm(cbind(agree, disagree) ~ sex*education, data = womensrole, family = binomial()) – Summary(womensrole_glm_2) • This summary shows a significant p-value for the sex: education interaction – Role. fitted 2 <predict(womensrole_glm_1, type = “response”) – Myplot(role. fitted 2) • The plot shows that less education is associated with agreement that women belong in the home.
The Deviance Residual • One of the many methods for checking the adequacy of the model fit – The deviance residual is the square root of the part of each observation that contributes to the deviance • Res <- residuals( womensrole_glm_2, type = “deviance”) – Pulls the residuals for many other models as well • Plot(predict(womensrole_glm_2), res, xlab = “fitted values”, ylab = “Residuals”, lim = max(abs(res))*c(-1, 1)) – No visible pattern: fit appears ok • Abline(h = 0, lty = 2) – Adds a dotted line at height 0
Drug Treatment Testing Famlilial Andenomatous Polyposis (FAP) • Counts of colonic polyps after 12 months of treatment – Don’t want to be the guy that did that • Placebo controlled (Binary factor) • Age
GLM Analysis • Multiple Regression won’t work. – Count data strictly positive – Normality is not probable • Poisson Regression- GLM with a log link function – Ensures a Poisson Distribution – Ensures positive fitted amounts • R command: – Polyps_glm_1 <- glm(number ~ treat + age, data = polyps, family = poisson())
Model Summary • • Call: glm(formula = number ~ treat + age, family = poisson(), data = polyps) • • • Deviance Residuals: Min 1 Q Median 3 Q Max -4. 2212 -3. 0536 -0. 1802 1. 4459 5. 8301 • • • Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4. 529024 0. 146872 30. 84 < 2 e-16 *** • treatdrug -1. 359083 0. 117643 -11. 55 < 2 e-16 *** • age -0. 038830 0. 005955 -6. 52 7. 02 e-11 *** • • --Signif. codes: 0 ‘***’ 0. 001 ‘**’ 0. 01 ‘*’ 0. 05 ‘. ’ 0. 1 ‘ ’ 1 • (Dispersion parameter for poisson family taken to be 1) • • • Null deviance: 378. 66 on 19 degrees of freedom Residual deviance: 179. 54 on 17 degrees of freedom AIC: 273. 88 • Number of Fisher Scoring iterations: 5
Model Problem: Overdispersion • In previous models the variance can be seen as dependent solely on the mean – E. G. Binomial, Poisson • In practice, this doesn’t always work. – Sometimes the raw data points are not independent, there is some correlation, in this case, possible “clustering” • Compare residual deviance and degrees of freedom to determine – These should be basically equal
Model Solution: Quasi-Likelihood • This procedure estimates the other factors that might contribute to the variance • R Command – Polyps_glm_2 <- glm(number ~ treat + age, data = polyps, family = quasipoisson()) – Summary(polyps_glm_2) • The coefficients are still significant, but less so
Homework • Please run through the commands for the myplot function (pg 100) and send me the command script. – Please send me what you thought of the presentation, give me a grade and add any constructive criticism. • zweihanderdawg@gmail. com
- Slides: 32