Regression Model – Can Non-Significant Coefficients Be Ignored

linear modelmodel selectionregression coefficientsregression-strategiesstatistical significance

After seeking clarification about linear model coefficients over here I have a follow up question concerning non-signficant (high p value) for coefficients of factor levels.

Example: If my linear model includes a factor with 10 levels, and only 3 of those levels have significant p values associated with them, when using the model to predict Y can I choose to not include the coefficient term if the subject falls in one of the non-signficant level?

More drastically, would it be wrong to lump the 7 non-significant levels into one level and re-analyze?

Best Answer

If you are putting in a predictor variable with multiple levels, you either put in the variable or you don't, you can't pick and choose levels. You might want to restructure the levels of your predictor variable to decrease the number of levels (if that makes sense in the context of your analysis.) However, I'm not sure if this would cause some type of statistical invalidation if you're collapsing levels because you see they are not significant.

Also, just a note, you say small $p$-values are insignificant. I assume that you meant small $p$-value are significant, ie: a $p$-value of .0001 is significant and therefore you reject the null (assuming an $\alpha$ level of $> .0001$?).

Related Solutions

Linear Regression – How to Apply Coefficient Term for Factors and Interactive Terms in a Linear Equation

This is not a problem specific to R. R uses a conventional display of coefficients.

When you read such regression output (in a paper, textbook, or from statistical software), you need to know which variables are "continuous" and which are "categorical":

The "continuous" ones are explicitly numeric and their numeric values were used as-is in the regression fitting.
The "categorical" variables can be of any type, including those that are numeric! What makes them categorical is that the software treated them as "factors": that is, each distinct value that is found is considered an indicator of something distinct.

Most software will treat non-numerical values (such as strings) as factors. Most software can be persuaded to treat numerical values as factors, too. For example, a postal service code (ZIP code in the US) looks like a number but really is just a code for a set of mailboxes; it would make no sense to add, subtract, and multiply ZIP codes by other numbers! (This flexibility is the source of a common mistake: if you are not careful, or unwitting, your software may treat a variable you consider to be categorical as continuous, or vice-versa. Be careful!)

Nevertheless, categorical variables have to be represented in some way as numbers in order to apply the fitting algorithms. There are many ways to encode them. The codes are created using "dummy variables." Find out more about dummy variable encoding by searching on this site; the details don't matter here.

In the question we are told that h and f are categorical ("discrete") values. By default, log(d) and a are continuous. That's all we need to know. The model is

$$\eqalign{ y &= \color{red}{-0.679695} & \\ &+ \color{RoyalBlue}{1.791294}\ \log(d) \\ &+ 0.870735 &\text{ if }h=h_1 \\ & -0.447570 &\text{ if }h=h_2 \\ &+ \color{green}{0.542033} &\text{ if }h=h_3 \\ &+ \color{orange}{0.037362}\ a \\ & -0.588362 &\text{ if }f=f_1 \\ &+ \color{purple}{0.816825} &\text{ if }f=f_2 \\ &+ 0.534440 &\text{ if }f=f_3 \\ & -0.085658\ a &\text{ if }h=h_1 \\ & -0.034970\ a &\text{ if }h=h_2 \\ & -\color{brown}{0.040637}\ a &\text{ if }h=h_3 \\ }$$

The rules applied here are:

The "intercept" term, if it appears, is an additive constant (first line).
Continuous variables are multiplied by their coefficients, even in "interactions" like the h1:a, h2:a, and h3:a terms. (This answers the original question.)
Any categorical variable (or factor) is included only for cases where the value of that factor appears.

For example, suppose that $\log(d)=2$, $h=h_3$, $a=-1$, and $f=f_2$. The fitted value in this model is

$$\hat{y} = \color{red}{-0.6797} + \color{RoyalBlue}{1.7913}\times (2) + \color{green}{0.5420} + \color{orange}{0.0374}\times (-1) + \color{purple}{0.8168} -\color{brown}{0.0406}\times (-1).$$

Notice how most of the model coefficients simply do not appear in the calculation, because h can take on exactly one of the three values $h_1$, $h_2$, $h_3$ and therefore only one of the three coefficients $(0.870735, -0.447570, 0.542033)$ applies to h and only one of the three coefficients $(-0.085658, -0.034970, -0.040637)$ will multiply a in the h:a interaction; similarly, only one coefficient applies to f in any particular case.

Regression Analysis – Should the Final R GLM Include Only Significant Levels of Factors?

For those initial factor predictors, it is arguable whether those insignificant levels should be "merged". Note that your approach seems to be simply dropping those insignificant levels, this approach is incorrect.

For a significant factor predictors (at least one level is significant), the number of its significant levels depends on which level is chosen as the base level. Because the estimate of the level is the difference between that level and the base level. For example, a significant factor has 4 levels A, B, C, D. If we choose level A as base level, we get result as below (only level D is significant)

B .

C .

D ****

However, when we choose level D as base level, we will find that all the levels are significant.

A ****

B ****

C ****

Because level A, B, C are similar while the level D is different from them.

As a result, simply dropping insignificant levels does not make sense. Lots of researcher think we should include all the levels as long as one of them is significant. The programmer of R is one of these. And this approach is simple.

Some researcher of other school think we can "merge" those insignificant levels to reduce the number of parameters. But this idea needs a more sophisticated approach to test all the potential combination of merging levels step by step.

For the example above, we can try to merge AB, AC and BC first and get three new models and find the best one (let we say AB, then we get AB, C, D). Then we can try to merge AB and C together and test it, because we should only drop one parameters and test it for each step. For those significant levels, we should also try to merge them for the reason mentioned first. So if we follow this school of researchers, the working load increase a lot (because we have to try all the combination pairs of the levels step by step, both significant levels and insignificant levels)

For those initial continuous variable, we may split them as factor/group, but this approach also need a sophisticated testing. We should firstly consider that numeric variable as a one-level factor, then try to split it into 2-level factor with all different splitting points and choose the best point. Then try different points to split one of the levels into two levels again. This idea is similar to the CART (classification and regression tree), which also splits numeric into discrete group/node, in order to model the non-linear effect.

Besides, we can use spline and so on to model the non-linear effect, which may be easier in some cases than splitting into factors.

Best Answer

Related Solutions

Linear Regression – How to Apply Coefficient Term for Factors and Interactive Terms in a Linear Equation

Regression Analysis – Should the Final R GLM Include Only Significant Levels of Factors?

Related Question