Solved – one do Wald test after linear regression

regressionwald test

I see that the Wald test can be performed on the model obtained from linear regression as shown here. I understand that it gives information whether a predictor is adding significant value to outcome variable. However, I am not clear why Wald test needs to be done since this information is already available from linear regression. Any clarification will be appreciated.

Edit: I checked with some linear regressions and Wald test for hypothesis (predictor_name = 0) gives almost exactly same P-value as the linear regression fitted model.summary() output (statsmodels package). So where is the need to do Wald test?

My main aim is: To determine "which predictors are independently affecting dependent variable".

Best Answer

You have to be very careful about what you mean when you say that your aim is to determine "which predictors are independently affecting the dependent variable."

Multiple regression adjusts for the values of other predictors when evaluating the association of a predictor with outcome. If you find that predictor $x_1$ isn't significantly associated with outcome when considered individually in a simple regression but it is when other predictors are considered along with it in a multiple regression, do you consider $x_1$ to be "independently affecting the dependent variable"?

This becomes more of an issue when the model includes interactions. With an interaction you can't really talk about whether $x_1$ is "independently affecting the dependent variable," because the model already implies that the association between $x_1$ and outcome depends on the values of the predictors that it interacts with.

In both of those cases, $x_1$ can be closely associated with outcome even if it isn't doing so independently of other predictors. I don't think that in either case you'd want to just ignore $x_1$.

With that warning, let's consider the usual coefficient values and tests reported by statistical software, and what Wald tests add.

The usual output from a multiple regression model contains estimates of the coefficients for each predictor and interaction term individually, along with the associated standard errors and statistical significance tests based on the ratio of the coefficient to the standard error. In ordinary least squares the test is a t-test, appropriate for situations with normally distributed errors in which you are estimating both the mean values and the standard errors from the data. In generalized linear models like logistic regression the t-test isn't valid so a normal approximation is used. The statistical test is then a z-test.

Two things to note. First, as the number of cases gets large, the distinction between the t-test and z-test becomes less and less important and the two tests will provide essentially the same result. Second, a z-test on a single coefficient, as in the usual output from regression software for a generalized linear model, is functionally the same as a Wald test. So with generalized linear models you can even say that the Wald test is the default test on the individual coefficients.

Wald tests are useful when you need to consider the association of multiple predictors with outcome together. An obvious example is when a predictor is involved in interaction terms with other predictors. You might want to know if any of the direct or interaction terms involving it is significantly different from zero. But there are other examples, too.

Consider a multi-level categorical predictor, even with just 3 levels. With standard treatment coding of that predictor, the reported coefficients are for the differences of each of 2 levels from the reference level. The apparent "significance" of one level thus can depend on the choice of the reference level. What you really care about is the association of the entire categorical variable with outcome, including all levels regardless of choice of reference level.

Or say that you have modeled a continuous predictor as a spline, resulting in multiple coefficients associated with it. Is that predictor associated with outcome when combining all those terms? Do the non-linear coefficients add anything?

Wald tests* provide a simple and general way to test such hypotheses. The usual application would be a test of whether all of a set of coefficients is 0. The test takes into account not only the variances of the individual coefficient estimates but also the covariances among them, which is important with correlated predictors that typically are found in practice.

So for considering whether $x_1$ is associated with outcome while considering all of its interaction terms, you do a Wald test on all of those coefficients. For evaluating a multi-level categorical predictor you do a Wald test on the coefficients for all levels of the predictor (necessarily excluding the reference level). For evaluating a spline-modeled continuous predictor, you do a test on all coefficients involving it. For evaluating whether the non-linear spline terms are adding anything, you evaluate all of their coefficients while omitting the linear term.

I don't use statsmodels so I can't speak to whether or under what conditions it performs Wald tests. If it only reports tests on individual coefficients then for ordinary least squares regression it probably is reporting t-tests, and for generalized models you might consider the coefficient tests to be functionally the same as Wald tests.

But such reports of single coefficients don't handle multi-level categorical predictors, polynomial- or spline-modeled continuous predictors, or predictors involved in interactions very well. To determine whether such predictors are "affecting the dependent variable," the Wald test provides a useful tool.


*In ordinary least squares regression, it's possible to use the chi-square statistic from a Wald test together with the error estimate from the regression to do an F-test rather than to depend on the asymptotic normality assumed by the Wald test. For simplicity, I'll include that analysis under "Wald test" here.

Related Question