Solved – Interpreting glm model output, assessing quality of fit

generalized linear modelgoodness of fitinterpretationrregression

I have a certain knowledge in stochastic processes (specially analysis of nonstationary signals), but in addition to be a beginner in R, I have never worked with regression models before.
Well, I have some doubts on understanding the outcome of the function summary() in R, when using with the results of a glm model fitted to my data. Well, suppose I used the following command to fit a generalized linear model to my data:**

glm_model <- glm(Output ~ (Input1*Input2) + Input3 + Input4, data = mydata)

Then I use summary(glm_model) to obtain the following:

Call: 
glm(formula = Output ~ (Input1*Input2) + Input3 + Input4, data = mydata)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-7.4583  -0.8985   0.1628   1.0670   6.0673  
Coefficients:

Estimate Std. Error t value Pr(>|t|)    
(Intercept)        8.522e+00  6.553e-02 130.041  < 2e-16 ***
Input1            -3.819e-04  3.021e-05 -12.642  < 2e-16 ***
Input2            -2.557e-04  2.518e-05 -10.156  < 2e-16 ***    
Input3            -3.202e-02  1.102e-02  -2.906  0.00367 **     
Input4            -1.268e-01  7.608e-02  -1.666  0.09570 .      
Input1:Input2      1.525e-08  2.521e-09   6.051 1.53e-09 ***    
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 2.487504)
    Null deviance: 18544  on 5959  degrees of freedom
Residual deviance: 14811  on 5954  degrees of freedom
  (1708 observations deleted due to missingness)
AIC: 22353
Number of Fisher Scoring iterations: 2

From a estimation theory perspective, I understand that "estimate" and "Std. Error" are the estimates and the standard deviation of the unknown parameters (beta1, beta2,…) of my model. However, there are some things I do not understand:

  1. How can I assess how good my fit is from the output of summary()? We could not use only the information of the standard deviation of the parameter estimators to assess the goodness-of-fit. I would expect to have access to the sampling distribution of a given parameter estimator to know the % of estimates within +- 1std, +-0.5std or any +-x*std, for example. Other option would be knowing the theoretical distribution of the parameter estimator, so as to try to calculate its Cramer Rao Lower Bound and compare with the calculated std.

  2. What does the t value (or Pr(>|t|) ) have to do with the goodness-of-fit? Since I am not familiar with regression models, I do not know the connection between the student t distribution and the estimation of the model parameters. What does it mean? Is the parameter estimator of the glm model distributed according to the student t pdf (like the sample estimator for small samples of an unknown population)? What conclusions should I take from Pr(>|t|)?

  3. Do we have a more general form of assessing the goodness-of-fit, like a measure of the variability of the data my model can capture, maybe a table of critical values for such a measure given a certain significance level?**

  4. When fitting a glm model, do we need to specify a significance level? If yes, why such an information is not provided by the summary function?

  5. The summary function outputs some measures based on information theory, like AIC: 22353. Can we define an optimal reference value for AIC? What is a good AIC value? My intuition is that we could not do so, like other information theory measures (mutual information, entropym,…)

Best Answer

First, as you have little prior experience with regression models, I would suggest that you obtain two freely available references. An Introduction to Statistical Learning covers linear regression and some examples of generalized linear models in a usefully broad context. Practical Regression and Anova using R, by Faraway, is more specifically focused on some of the questions you have.

Second, the glm model you presented seems to be equivalent to a standard linear regression model as usually analyzed by lm in R. The output of summary from an lm result might be more useful if your problem is a standard linear regression. glm is used for models that generalize linear regression techniques to "Output" or response variables that, for example, are classifications or counts rather than continuous real numbers. The glm summary may omit some types of lm summary values that are not properly provided by these generalized models, but it does provide the AIC value that is appropriate for models fit by the maximum-likelihood approach that glm uses.

Third, you need to be aware of an important distinction between different meanings of "goodness-of-fit." One meaning, captured readily from the output from lm, is how well the model fits the particular sample of data that you have. Depending on your application, however, you might be more interested in how well the model will generalize to new data samples. For that latter interest you will have to combine regressions with techniques like bootstrapping or cross-validation.

Fourth, as you have listed your predictor variables as "Input" and your outcome variable as "Output," you might be analyzing time-series variables. In that case more specialized techniques may be required to take into account issues like trends and autocorrelations. See this Cross Validated page as one place to start.

Now for your questions:

  1. The summary of an lm model includes an "Adjusted R-squared" value that is a simple summary of overall goodness of fit; it's essentially a measure of the fraction of overall variance that the model accounts for, with a correction for the number of variables that the model fits. That, however, is insufficient for testing the validity of a linear regression. For that you need to evaluate whether residual errors are relatively independent of fitted values, whether particular data points are unduly affecting the results, and so on. A plot of an lm model is a good way to start. The Faraway reference noted above goes into some detail. (See below for confidence intervals.)

  2. The estimated regression coefficients, under the usual assumptions of linear regression, follow a Student t distribution. The probabilities listed in the summary specify how frequently a coefficient of that magnitude would be found by chance, if the coefficient were truly 0 with that standard error of estimation. The standard errors can be used to set up confidence intervals for the coefficients (question 1), as the Faraway reference demonstrates.

  3. This is essentially covered in the answer to (1) above. I caution you to pay less attention to general measures of goodness-of-fit and more attention to the more detailed tests noted above that document whether the linear model is even a reasonable fit to begin with.

  4. In standard frequentist statistical testing, threshold p-values are pre-specified and those cases that pass that threshold are deemed "significant." If you had pre-specified p < 0.05, then all coefficients except for Input4 would be considered significantly different from 0, and the interaction term between Input1 and Input2 would also be significant, based on the t-tests noted in the answer to (2).

  5. AIC values are useful for comparing among different models of the same data. The Wikipedia page explains it well, and the Faraway reference also explains it in the context of choosing among linear regression models. AIC is a measure of the likelihood (in a technical sense) of the model, corrected for the number of parameter values fit by the model. As for any such measure, what's a "good" AIC depends heavily on the subject matter; what might be spectacularly good for a clinical study would be terrible for particle physics. Some software reports AIC values without constant terms that can be ignored in model comparisons where only differences in AIC matter. Thus I would suggest that you not trust AIC values reported by a particular statistical package to be "true" AIC values unless you know the package very well.

Related Question