Solved – Significance vs. goodness-of-fit in regression

goodness of fitmultiple regressionp-valueregressionstatistical significance

Assume that I am interested in analyzing the following linear regression model:
$$
Y = \beta_0 +\beta_1 x_1 +\beta_2 x_2+e
$$

Please explain the difference between testing the p-value for each coefficient $\beta_i$ separately, and performing a goodness-of-fit test for the model?

In particular:

  1. Is it true to say that the p-value for each coefficient corresponds to the null hypothesis that this coefficient is actually zero (for example, in MATLAB's glmfit function)?

  2. Is it possible that a model resulting in a really good fit will have high p-values for all the coefficients? Is it possible that a model with low p-values for all the coefficients will result in a poor fit?

Best Answer

  1. Yes, the p-values that come with standard regression output are testing if the associated beta (slope coefficient) is $0$. (It is possible to get p-values for tests against other values, but you have to know how to set that up—it isn't what software does by default, and it really isn't very common.)
  2. Yes, you can have high p-values for individual coefficients with a good fit and low p-values with a poor fit. The reason for this is straightforward: goodness of fit is a different question than whether the slope of the $X,\ Y$ relationship is $0$ in the population. Generally, when running a regression, we are trying to determine a fitted line that traces the conditional means of $Y$ at different values of $X$. (It is also possible to wonder about other aspects of a model, but that is the most basic and common feature.) Thus, a goodness of fit assessment is whether the model's fitted conditional means actually match the data's conditional means. The answer to this latter question can be either yes or no independently of whether the best estimate of the slope is $0$.

    Consider the following examples, which are coded in R. (I don't have access to MATLAB, but the code here is intended to be as close to pseudocode as I can make it.)

    ##### high p-value, good fit
    set.seed(6462)                  # this makes the example exactly reproducible
    x1 = runif(100, min=-5, max=5)  # the x-variables are uniformly distributed
    x2 = runif(100, min=-5, max=5)  #  between -5 and 5
    e  = rnorm(100, mean=0, sd=1)   # these are the errors
    y  = 0 + 0*x1 + 0*x2 + e        # the true intercept & sloes are 0
    
    m1 = lm(y~x1+x2)
    summary(m1)
    # ...
    # Coefficients:
    #               Estimate Std. Error t value Pr(>|t|)
    # (Intercept) -0.1257881  0.0992355  -1.268    0.208     # these p-values are
    # x1           0.0009124  0.0307466   0.030    0.976     # high & non-significant
    # x2          -0.0243975  0.0316458  -0.771    0.443
    # 
    # Residual standard error: 0.9884 on 97 degrees of freedom
    # Multiple R-squared:  0.006149,  Adjusted R-squared:  -0.01434 
    # F-statistic: 0.3001 on 2 and 97 DF,  p-value: 0.7415   # the whole model is ns
    

    enter image description here

    ##### low p-values, poor fit
    # the true intercept & sloes are not 0, but the relationships are curvilinear
    y2 = 5 + 0.65*x1 + -0.17*x1^2 + 0.65*x2 + -0.17*x2^2 + e  
    
    m2 = lm(y2~x1+x2)
    summary(m2)
    # ...
    # Coefficients:
    #             Estimate Std. Error t value Pr(>|t|)    
    # (Intercept)  1.42633    0.21650   6.588 2.31e-09 ***  # very low p-values
    # x1           0.64189    0.06708   9.569 1.14e-15 ***
    # x2           0.58869    0.06904   8.527 2.01e-13 ***
    # ...
    # 
    # Residual standard error: 2.156 on 97 degrees of freedom
    # Multiple R-squared:  0.6152,  Adjusted R-squared:  0.6073 
    # F-statistic: 77.54 on 2 and 97 DF,  p-value: < 2.2e-16
    

    enter image description here

    What these examples show are a model that has high / non-significant p-values, but a good fit for the predicted means (because the true slopes are $0$), and a model with very low / highly significant p-values, but a poor fit for the predicted means (because, although the slopes within the regions spanned by the data are far from $0$, they are also not very close to straight lines). The p-values are easy to see and understand in the output. To see the quality of the models' fits to the conditional means, I plotted the true data generating process (in this case I have it, because the data are simulated, but in general you won't). In a more typical case, you would just see if the predicted means do a reasonable job of tracing the observed conditional means in your dataset; here I did that by plotting LOWESS lines. (The plots only display x1, and collapse over x2, but I could make analogous plots with x2, or various kinds of fancy plots with both x1 and x2, and they would show the same thing.)

Related Question