Solved – How to interpret model diagnostics when doing linear regression in R

diagnosticrregression

I ran lm() on my data with models selected by individual lm's of each characteristic and then combined the top $R^2$ based on $p$-value. For instance, the first few characteristics are taken, then the rest are evaluated if they have $p<.005$. My characteristics contain some duplication: for instance, I have a characteristic and its normalized variant in test P. My $p$-values are all very small but my diagrams do not look correct for R and T. (Referring to this blog post: Evaluating Linear Regression Model in R.)

In test P (and T) there is one outlier according to Cooks Distance. How do I find and eliminate that instance?

According to this tutorial on Using R for Linear Regression,

The plot in the upper left shows the residual errors plotted versus
their fitted values. The residuals should be randomly distributed
around the horizontal line representing a residual error of zero; that
is, there should not be a distinct trend in the distribution of
points.

Test P looks ok in the residual error but test R and T have a grouping what does that mean and how do I account for it?

The plot in the lower left is a standard Q-Q plot, which should
suggest that the residual errors are normally distributed. The
scale-location plot in the upper right shows the square root of the
standardized residuals (sort of a square root of relative error) as a
function of the fitted values. Again, there should be no obvious
trend in this plot.

Again Test P looks ok in the standard Q-Q plot but test R and T have a grouping what does that mean and how do I account for it?

Also what is the coefficients on the output. I notice it lists the characteristics and a p value but i don't understand what it means.

And finally how do I make predictions using the model I created?

Test P
F-statistic: 2.684 on 280 and 2221 DF, p-value: < 2.2e-16
enter image description here

Test R
F-statistic: 3.691 on 258 and 2243 DF, p-value: < 2.2e-16
enter image description here

Test T
F-statistic: 4.029 on 268 and 2233 DF, p-value: < 2.2e-16
enter image description here

edit after running gls my p looks like this

Best Answer

This is a long and rambling question, so you are getting a long and rambling answer. Apologies. Using the example from the ?lm() call,

ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2,10,20, labels=c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
summary(lm.D9)
#output#
Call:
lm(formula = weight ~ group)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4938  0.0685  0.2462  1.3690 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.0320     0.2202  22.850 9.55e-15 ***
groupTrt     -0.3710     0.3114  -1.191    0.249    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared: 0.07308,    Adjusted R-squared: 0.02158 
F-statistic: 1.419 on 1 and 18 DF,  p-value: 0.249 

I don't entirely understand your confusion on the "coefficients." The table simply presents the OLS estimate of $\beta$, standard error of the estimate $SE(\beta)$, the "distance" that $\beta$ is from 0 on the Normal$(0, SE(\beta))$ distribution, and the probability of observing a $\beta$ that far away from 0. Forgive me for the basic statistics review; I can't tell if this is what you are asking for.

Proper OLS-estimated regression modeling (which is what the lm command runs) requires several assumptions, and these diagnostic plots are designed to test them.

The "Residuals vs Fitted" and "Scale-Location" charts are essentially the same, and show if there is a trend to the residuals. OLS models require that the residuals be "identically and independently distributed," that their distribution does not change substantially for different values of $x$. None of your charts is really satisfactory on this regard. If this assumption is not met, your $\beta$ estimates will still be good, but your $t$-statistics, and corresponding $p$-values, are garbage.

Another assumption is that the errors are approximately normally distributed, which is what the Q-Q plot allows you to see. Again, none of your plots really satisfies me in this regard. The consequences of this assumption not being met are the same as above ($\beta$'s good, $t$'s worthless).

The "outliers" principle is actually not an assumption of OLS regression. But if you have outliers in certain locations, your $\beta$ parameters will be unduly influenced by them. In this case, both your $\beta$ and $t$ measurements are useless. You can remove an influential observation from a data frame by identifying its row number and issuing the command

data <- data[-offending.row,]

Where offending.row is the number of the row you want to eliminate. The R diagnostic plots label the row numbers of potential outliers.

I don't know what kind of data you have, but you should be very careful about eliminating observations that you don't like. You should instead ask yourself how that observation became this way. If it is due to measurement error, by all means discard it. If not, then is this observation a part of the system you are trying to model? If so, you should keep it in and adapt for it in other ways.

I have two suggestions for your analysis. First, try to use GLS estimators. This method assigns weights to your observations to correct for heteroskedasticity, outliers, and some degree of non-normality. The R command for this is gls().

But it seems from your plots that your data are restricted in some ways. In particular Test-P seems like a variable that is either 1 or 0, or restricted to that range. For such a variable, you may want to look at binary logit or probit models, available with the command

glm(model, family=binomial(link="logit"))

If your data is censored at 0 but not on the upper end, a tobit model is what you want, tobit() from the AER package looks like the right thing (I've never run a tobit model, I have just looked at it theoretically).

Finally, predictions are done with the predict() function. If you want to perturb your data afterwards (to create a distribution of possible predictions), the best way I know of it to add a random number to the prediction. Using the example above,

#base prediction
pred.values <- predict(lm.D9)
# get standard error of residuals
SER <- (summary(lm.D9)$sigma)^2
#perturbations
pert <- rnorm(length(pred.values), mean=0, sd=SER)
SIMULATION.VALUES <- pred.values + pert

You can get multiple alternate simulations by repeating the last two steps. Good luck.

Related Question