Solved – R: Setting an F-statistic to determine variables for a multiple linear regression model

f-testmultiple regressionp-valuer

I am trying to understand the steps behind the linear regression process. I already have a linear model like:

lmodel1 <- lm(y~x1+x2+x3, data=dataset)

for which R calculates several different things (Coefficients, Intercept, Residuals, F-statistic and p-value) among others.

At this point, I am mostly interested in F-statistic and p-value.
So far, I have concluded to the following:

The process is iterative and begins taking under consideration every variable. In order to achieve an optimal y some x variables have to be "taken out". This comes as a result of calculating F-statistic, which quantifies the importance of each xi and the dependent variable y.
When F value is smaller than p-value(?) that variable is removed.
Next step of the process is to compare that F-statistic of a xi independent variable, with an F-to-enter and F-to-remove in order see if the removed variable will be re-inserted to the equation.(?)

Now, please do correct me if i am wrong regarding the steps described above.
Is that what happens under lm()'s hood. Are those the right variables?

R-wise speaking how does these values can be shown, inserted or calculated in a multiple linear regression model.?

How is ANOVA related to the above?

I am afraid R's summary and help take too much for granted.

Best Answer

The lm function calculates the coefficients, but it does not calculate F-statistics or p-values, you need to run another function (summary, anova, etc.) on the results to see p-values. What the p-values mean depend on which function calculated them and how it was called. You seem to have run at least some of these functions based on your question, but it is not clear which ones and how they were run.

You first need to decide what question or questions you are trying to answer. Then based on those questions you can decide on which functions to run on your regression and which tests to examine (sometimes (often) additional tests are included which should just be ignored).

Also, the F-statistic (or t-statistic) is a step to a p-value, we don't compare p-values to F-statistics, we compute p-values from F-statistics.

When you run summary on a linear regression it computes an overall F-test (reported near the bottom) that compares the full model to an overall mean. This answers the question of if any subset (including the whole set) of the potential predictors are significantly related to the response. The summary function also does an adjusted t-test for each predictor testing if that predictor adds significantly to the prediction above and beyond the effect of all other predictors in the model.

The anova function when given a single regression object does a set of sequential tests. Reading from top to bottom the first test is that the first predictor is significant by itself, then the second predictor is significant above and beyond the 1st (but ignoring the others), the 3rd test is that the 3rd predictor adds significantly above and beyond the 1st 2, etc. These tests are really only meaningful if you put the predictors in a specific order to begin with because of the tests of interest. This functionality is mainly left over from the days when a single analysis took hours or even days. Now we can just fit a new model with a couple keystrokes and a few seconds, so the tests of interest don't need to be as planned out ahead of time.

If the anova function is given 2 or more (nested) regression models then it does a full and reduced model F-test where the null hypothesis is that the simpler model fits just as well as the fuller model and the alternative is that the full model adds information beyond what is in the simpler model (if more than 2 models then it compars 1 vs. 2, then 2 vs. 3, etc.). This tests all the terms in the full model but not the reduced simultaneously.

I have not figured out what question stepwise regression answers, just that it does not answer any of the questions that I am interested in. The consensus is moving away from doing automated stepwise regression.