Solved – Appropriate residual degrees of freedom after dropping terms from a model

model selectionrregressionregression-strategies

I am reflecting on the discussion around this question and particularly Frank Harrell's comment that the estimate for variance in a reduced model (ie one from which a number of explanatory variables have been tested and rejected) should use Ye's Generalized Degrees of Freedom. Professor Harrell points out this will be much closer to the residual degrees of freedom of the original "full" model (with all the variables in) than that from a final model (from which a number of variables have been rejected).

Question 1. If I want to use an appropriate approach to all the standard summaries and statistics from a reduced model (but short of a full implementation of Generalized Degrees of Freedom), would a reasonable approach be to just use the residual degrees of freedom from the full model in my estimates of residual variance, etc?

Question 2. If the above is true and I want to do it in R, might it be as simple as setting

finalModel$df.residual <- fullModel$df.residual

at some point in the model fitting exercise, where finalModel and fullModel were created with lm() or a similar function. After which functions such as summary() and confint() seem to work with the desired df.residual, albeit returning an error message that someone has clearly mucked around with the finalModel object.

Best Answer

Do you disagree with @FrankHarrel's answer that parsimony comes with some ugly scientific trade-offs, anyways?

I love the link provided in @MikeWiezbicki's comment to Doug Bates' rationale. If someone disagrees with your analysis, they can do it their way, and this is a fun way to start a scientific discussion about your base assumptions. A p-value does not make your conclusion an "absolute truth".

If the decision of whether or not to include a parameter in your model comes down to "picking hairs" over what are, for scientifically meaningful samples, relatively small discrepancies in the df -- and you are not dealing with $n<p$ problems that justify more nuanced inference, anyways -- then you have a param so close to meeting your cutoffs that you should be transparent and talk about it either way: just include it, or analyze the model with and without it, but definitely transparently discuss your decision in the final analysis.

Best Answer

Related Solutions

Solved – GLM after model selection or regularization

Solved – Do I need to adjust the degrees of freedom returned by pool.compare() in MICE

Related Question