I cannot reproduce exactly this phenomenon, but I can demonstrate that VIF does not necessarily increase as the number of categories increases.
The intuition is simple: categorical variables can be made orthogonal by suitable experimental designs. Therefore, there should in general be no relationship between numbers of categories and multicollinearity.
Here is an R
function to create categorical datasets with specifiable numbers of categories (for two independent variables) and specifiable amount of replication for each category. It represents a balanced study in which every combination of category is observed an equal number of times, $n$:
trial <- function(n, k1=2, k2=2) {
df <- expand.grid(1:k1, 1:k2)
df <- do.call(rbind, lapply(1:n, function(i) df))
df$y <- rnorm(k1*k2*n)
fit <- lm(y ~ Var1+Var2, data=df)
vif(fit)
}
Applying it, I find the VIFs are always at their lowest possible values, $1$, reflecting the balancing (which translates to orthogonal columns in the design matrix). Some examples:
sapply(1:5, trial) # Two binary categories, 1-5 replicates per combination
sapply(1:5, function(i) trial(i, 10, 3)) # 30 categories, 1-5 replicates
This suggests the multicollinearity may be growing due to a growing imbalance in the design. To test this, insert the line
df <- subset(df, subset=(y < 0))
before the fit
line in trial
. This removes half the data at random. Re-running
sapply(1:5, function(i) trial(i, 10, 3))
shows that the VIFs are no longer equal to $1$ (but they remain close to it, randomly). They still do not increase with more categories: sapply(1:5, function(i) trial(i, 10, 10))
produces comparable values.
There is no change in the interpretation of the parameters since the parameters being estimated are algebraically identical between the linear regression model with heteroskedasticity and the transformed model, OLS on which gives the WLS estimator.
Let us take this at a leisurely pace.
Linear regression model
The linear regression model (potentially with heteroskedasticity) is the following
$$
\begin{align}
Y_i &= \beta_0 + \beta_1 X_{1i} + \dots + \beta_K X_{Ki} + \varepsilon_i \\
\mathbb{E}(\varepsilon_i \mid X_{1i}, \ldots, X_{Ki}) &= 0
\end{align}
$$
This is equivalent to the model for the conditional mean for $Y_i$,
$$
\mathbb{E}(Y_i \mid X_{1i}, \ldots, X_{Ki}) = \beta_0 + \sum_{k=1}^K \beta_k X_{ki}
$$
Interpretation of parameters
From here, we can get at the standard interpretation of the parameters of the linear regression model as marginal effects, that is
$$
\dfrac{\partial \mathbb{E}(Y_i \mid X_{1i}, \ldots, X_{Ki})}{\partial X_{ki}} = \beta_k
$$
This states, that the regression coefficient of a regressor is the effect of a unit change in that regressor on the conditional mean of the outcome variable. Note that this interpretation is made independent of the heteroskedasticity assumption in the model. This is the interpretation that the estimated parameters retain, for OLS and WLS.
Transformed linear regression model
Now consider that we transform the original regression model, under the assumption that the error heteroskedasticity has the following form
$$
\mathbb{E}(\varepsilon_i^2\mid X_{1i}, \ldots, X_{Ki}) = \sigma^2 X_{ki}.
$$
The transformed model is
$$
\frac{Y_i}{\sqrt{X_{ki}}} = \beta_0 \frac{1}{\sqrt{X_{ki}}} + \beta_1\frac{X_{1i}}{\sqrt{X_{ki}}} + \ldots+\beta_k \frac{X_{ki}}{\sqrt{X_{ki}}} +\ldots + \beta_K \frac{X_{Ki}}{\sqrt{X_{Ki}}} + \underbrace{\frac{\varepsilon_i}{\sqrt{X_{Ki}}}}_{\equiv \nu_i}
$$
Aside: A more usual simple model for heteroskedasticity is
$$
\mathbb{E}(\varepsilon_i^2\mid X_{1i}, \ldots, X_{Ki}) = \sigma^2 X_k^2
$$
in order to preserve the positiveness of the second moment.
Note that the model is now a classical linear regression model, since
$$
\begin{align}
\mathbb{E}(\nu_i\mid X_{1i}, \ldots, X_{Ki}) &= 0 \\
\mathbb{E}(\nu_i^2\mid X_{1i}, \ldots, X_{Ki}) &= \sigma^2
\end{align}
$$
Therefore, OLS estimates of the parameters from this transformed model (that is, the WLS estimator) are BLUE, which is the whole point of the exercise. Note that a constant should not be included in the estimation of this model. Also note that I have used the original conditioning regressors as conditioning variables, rather than the transformed regressors, since it is easy to see that the same functions are measurable with respect to the two conditioning sets.
Interpretation of parameters
The transformed model is equivalent to
$$
\mathbb{E}\left(\frac{Y_i}{\sqrt{X_{ki}}}\mid X_{1i}, \ldots, X_{Ki}\right) = \beta_0 \frac{1}{\sqrt{X_{ki}}} + \sum_{l=1}^K\beta_l\frac{X_{li}}{\sqrt{X_{ki}}}
$$
This is now the crucial part -- consider the expressions for the marginal effect on the transformed outcome w.r.t. to one of the original regressors, and w.r.t. the transformed regressors.
-
$$
\begin{align}
\frac{\partial \mathbb{E}\left(\frac{Y_i}{\sqrt{X_{ki}}}\mid X_{1i}, \ldots, X_{Ki}\right)}{\partial X_{li}} &= \beta_l \frac{\partial X_{li}/\sqrt{X_{ki}}}{\partial X_{li}} \\
&= \beta_l
\end{align}
$$
The same as before! Here I have used the fact that
$$
\mathbb{E}\left(\frac{Y_i}{\sqrt{X_{ki}}}\mid X_{1i}, \ldots, X_{Ki}\right) = \frac{1}{\sqrt{X_{ki}}}\mathbb{E}(Y_i \mid X_{1i}, \ldots, X_{Ki})
$$
since conditioning variables are treated as constant by the expectations operator.
- On the other hand, if I find the marginal effect with respect to the transformed regressor, I get
$$
\frac{\partial\mathbb{E}\left(\frac{Y_i}{\sqrt{X_{ki}}}\mid X_{1i}, \ldots, X_{Ki}\right)}{\partial \frac{X_{li}}{\sqrt{X_{ki}}}} = \beta_l \sqrt{X_{ki}}
$$
which is clearly not the same as the parameter being estimated. To elaborate, this is the interpretation you are asking about -- "are the $\beta$s estimated by WLS the effect of a unit change in the rescaled regressors?" The answer, as demonstrated here, is no.
Why this makes sense
Note that you formulate the model the way you do (in terms of the original outcomes and regressors) because you are interested in the parameters of that model (the original $\beta_k$s). Features such as heteroskedasticty reduce the efficiency of the OLS estimated parameters and you might want to correct for that using WLS, and (F)GLS. But it would be slightly counterproductive if this changed the interpretation of the model parameters that you are interested in. The key is in the way you say it -- OLS and WLS estimates of the model parameters, implying one set of population parameters being estimated by both estimators. This can be formalised by saying that the OLS and WLS parameters are consistent for the same population parameters, however, they differ in their asymptotic efficiency.
What most applied economists do
Most applied economists would rather their parameters were close to the truth with high probability as the sample size grows, i.e., that is their parameter estimates were consistent. A crucial aspect of WLS and FGLS is that they require the specification of an auxiliary model for the heteroskedasticity, in order to get at the extra efficiency afforded by those estimators. However, the price of getting this auxiliary model wrong is that the property of consistency is lost. Most applied economists prefer to simply use White robust standard errors to correct the estimates of the standard errors of OLS estimates, and live with the lower efficiency of their estimators.
Best Answer
It's a great question, because it concerns an important and fundamental issue about multiple regression. The answer is that weighting can change the VIF (a measure of collinearity among the regressors) by arbitrary amounts up or down (with a lower limit of $1$).
To see why that is, consider a model with two regressors $X_1$ and $X_2$. (We do not have to consider the response variable: it plays no role in computing the VIF.) Here is a scatterplot of 50 observations, with the symbol areas proportional to the weights:
If those weights were not applied, clearly the correlation would be strongly positive, because the points line up closely along a positively sloping line. Indeed, their (unweighted) VIF is $13.54$: large enough to drive one to investigate this model carefully for effects of collinearity. (Rules of thumb assert that VIFs above $5$ or $10$ begin to be of concern.)
The weights, though, throw the correlation in altogether a different direction. The points with the heaviest weights tend to be negatively correlated: some towards the upper left, others toward the lower right. These cause the weighted correlation to be nearly zero. Indeed, the VIF for these weighted data is merely $1.44$: low and benign.
The procedure works in reverse: we could weight the data to perform an OLS and then unweight them in a WLS. Thus it's just as possible that the weighted fit will reduce the VIF as increase it.