Solved – Interpreting dumthe variables in glm

categorical datageneralized linear modelr

I'm trying to understand the output of glm when a categorical variable has more than 2 categories.

I'm analysing if age affects death. Age is a categorical variable with 4 categories

I use the following code in R:

mydata <- read.delim("Data.txt", header = TRUE)
mydata$Agecod <- factor(mydata$Agecod)
mylogit <- glm(Death ~ Agecod, data = mydata, family = "binomial")
summary(mylogit)

Obtaining the following output:

Call:
glm(formula = Death ~ Agecod, family = "binomial", data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4006  -0.8047  -0.8047   1.2435   2.0963  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   0.5108     0.7303   0.699   0.4843  
Agecod2      -0.6650     0.7715  -0.862   0.3887  
Agecod3      -1.4722     0.7658  -1.922   0.0546 .
Agecod4      -2.5903     1.0468  -2.474   0.0133 *

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 237.32  on 184  degrees of freedom
Residual deviance: 223.73  on 181  degrees of freedom
  (1 observation deleted due to missingness)
AIC: 231.73

Number of Fisher Scoring iterations: 4

Since I have p-values for Agecod2, Agecod3 and Agecod4 and only Agecod4 has a significant p-value my questions are:

Is really Age associated with death?
Is only the 4th age category associated with death?
What happens with the first category since I don't have its p-value?

Update:

Since Antoni Parellada says “It seems as though you have proven that old age is a good predictor of death” and Gung points “You cannot tell from your output if Age is associated with death” I’m still confused.

I understand that “Intercept” is representing Agecod1 and is the “reference level”. According to Gung “The Estimates for the rest are the differences between the indicated level and the reference level. The associated p-values are for the tests of the indicated level vs. the reference level in isolation.”

My question now is:

Since Agecod4 p-value (0.0133) is significantly different from Agecod1 (reference lelvel) it doesn’t mean that age is associated with death?

I have also tried to perform a nested test with the following command:

anova(mylogit, test="LRT")

Obtaining:

       Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
NULL                     184     237.32            
Agecod  3   13.583       181     223.73 0.003531 *

Does it mean that Age is definitively associated with death?

Update2:

I have solved my problem using binary logistic regression in SPSS. The output is the same than “mylogit” but with SPSS I obtain a global p-value for the overall variable Agecod which is 0.008.

I don’t know if is possible to obtain this “global p-value” with R, but since I know that I can use SPSS is not a big problem for me.

Best Answer

(Taking these out of order.)

The first category, Agecod1, is represented by the intercept. That is called the "reference level" of your factor variable. The Estimate for (Intercept) is the mean of the response for that level. The Estimates for the rest are the differences between the indicated level and the reference level. The associated p-values are for the tests of the indicated level vs. the reference level in isolation. They probably don't answer the question you actually have; they may best be ignored.

It makes no sense to say that "only the 4th age category associated with death". No such thing is logically possible.

You cannot tell from your output if Age is associated with death. You need to fit a nested model that does not have Age, but is otherwise identical. Then you can perform a nested model test (anova(nested_model, mylogit, test="LRT")).

Updated to respond to additional information.

The anova() you ran tests Age as a whole. The p-value is listed as 0.003531, so that is significant unless your chosen alpha is less than that (which would be very unusual). Since your model is a logistic regression, there are several ways that such a test can be run, and therefor several p-values are possible. It is possible that SPSS is using a different method, and that both p-values are valid according to their own assumptions. To understand the different possibilities, it may help you to read my answer here: Why do my p-values differ between logistic regression output, chi-squared test, and the confidence interval for the OR?

Related Solutions

Solved – Interpreting glm model output, assessing quality of fit

First, as you have little prior experience with regression models, I would suggest that you obtain two freely available references. An Introduction to Statistical Learning covers linear regression and some examples of generalized linear models in a usefully broad context. Practical Regression and Anova using R, by Faraway, is more specifically focused on some of the questions you have.

Second, the glm model you presented seems to be equivalent to a standard linear regression model as usually analyzed by lm in R. The output of summary from an lm result might be more useful if your problem is a standard linear regression. glm is used for models that generalize linear regression techniques to "Output" or response variables that, for example, are classifications or counts rather than continuous real numbers. The glm summary may omit some types of lm summary values that are not properly provided by these generalized models, but it does provide the AIC value that is appropriate for models fit by the maximum-likelihood approach that glm uses.

Third, you need to be aware of an important distinction between different meanings of "goodness-of-fit." One meaning, captured readily from the output from lm, is how well the model fits the particular sample of data that you have. Depending on your application, however, you might be more interested in how well the model will generalize to new data samples. For that latter interest you will have to combine regressions with techniques like bootstrapping or cross-validation.

Fourth, as you have listed your predictor variables as "Input" and your outcome variable as "Output," you might be analyzing time-series variables. In that case more specialized techniques may be required to take into account issues like trends and autocorrelations. See this Cross Validated page as one place to start.

Now for your questions:

The summary of an lm model includes an "Adjusted R-squared" value that is a simple summary of overall goodness of fit; it's essentially a measure of the fraction of overall variance that the model accounts for, with a correction for the number of variables that the model fits. That, however, is insufficient for testing the validity of a linear regression. For that you need to evaluate whether residual errors are relatively independent of fitted values, whether particular data points are unduly affecting the results, and so on. A plot of an lm model is a good way to start. The Faraway reference noted above goes into some detail. (See below for confidence intervals.)
The estimated regression coefficients, under the usual assumptions of linear regression, follow a Student t distribution. The probabilities listed in the summary specify how frequently a coefficient of that magnitude would be found by chance, if the coefficient were truly 0 with that standard error of estimation. The standard errors can be used to set up confidence intervals for the coefficients (question 1), as the Faraway reference demonstrates.
This is essentially covered in the answer to (1) above. I caution you to pay less attention to general measures of goodness-of-fit and more attention to the more detailed tests noted above that document whether the linear model is even a reasonable fit to begin with.
In standard frequentist statistical testing, threshold p-values are pre-specified and those cases that pass that threshold are deemed "significant." If you had pre-specified p < 0.05, then all coefficients except for Input4 would be considered significantly different from 0, and the interaction term between Input1 and Input2 would also be significant, based on the t-tests noted in the answer to (2).
AIC values are useful for comparing among different models of the same data. The Wikipedia page explains it well, and the Faraway reference also explains it in the context of choosing among linear regression models. AIC is a measure of the likelihood (in a technical sense) of the model, corrected for the number of parameter values fit by the model. As for any such measure, what's a "good" AIC depends heavily on the subject matter; what might be spectacularly good for a clinical study would be terrible for particle physics. Some software reports AIC values without constant terms that can be ignored in model comparisons where only differences in AIC matter. Thus I would suggest that you not trust AIC values reported by a particular statistical package to be "true" AIC values unless you know the package very well.

Solved – How to predict & plot quasipoisson GLM in R

From my research, it seems like both ways could be 'appropriate' for illustration, but many would probably agree that the second graph, with the plot on the scale of the response variable, is more intuitive to understand. I found this description of interpreting Poisson regressions to be helpful.

That document states that the equation would look like this for a single covariate Poisson model: ln(yi) = β0 + β1xi. This is equivalent to yi = e^(β0 + β1xi).

In my output, β0 is the intercept at 5.489, and β1 is the coefficient at -0.0027.

To determine what the mean value of y is at the intercept of x=0, I take e^β0, which is:

> exp(5.489465)
[1] 242.1276

The result of increasing x by 1 unit, has multiplicative effect on the mean of the Poisson by e^β1. So, to figure out what that is, I take the coefficient of -.0027, and do the same, e^-.0027:

> exp(-0.002744)
[1] 0.9972598

To get the value of y at x=1, I take the value of the intercept (242.1276) and multiply it by 0.997 to get the value of x=1.

> 242.1276 * 0.9972598
[1] 241.4641

The value at x=2 takes that value from x=1, (241.46), and multiplies by .997 again, equaling 240.802477. The prediction line simply does this along a list of x values multiplying the value of y at x-1 by .997.

An alternative way to understand and create this predictor line is to take the values of the linear plot (the first plot in the question) and compute the exponential of the value of y at any point along the line.

I have not yet figured out the issue with the negative binomial regression and plot, but I think this will suffice for my purposes.

Best Answer

Related Solutions

Solved – Interpreting glm model output, assessing quality of fit

Solved – How to predict & plot quasipoisson GLM in R

Related Question