Solved – Heteroskedasticity in linear regression model & data transformation

econometricsheteroscedasticity

pls correct me if i'm wrong. of the econometrics' literature i've read so far, most mentioned heteroskedasticity is not a major problem empirically but multicollinearity would pose a greater concern to researchers and that data transformation would improve empirical results but not completely remove heteroskedasticity. Is this right?

by above, i've used R to conduct a basic linear regression with one regressor and three dummies and conducted Breusch-Pagan test using package 'car' and VIF test using package 'lmtest'. Thereafter, i conducted a second regression, log-transforming the dependent and independent variable and carried similar tests as aforementioned.

for both models, m1 for former and m2 for latter, and using VIF test, both presented less than 5 and reasonably to say multicollinearity isn't a problem

however, for Breusch-Pagan, m1 failed to reject null but m2 (logged) rejected it. can someone enlighten me why the result is not consistent with the theory above?

some related questions,

would it better not to log-transform the model?
i'm suspecting that the regressor is correlated with the residuals. is there test(s) 'out there' for endogeneity?
if log model is preferred, what can i then do to 'lessen' heteroskedasticity in m2?

Thanks in advance!

Dylan

Best Answer

Actually, I'd say just the opposite. Multicolinearity is often scoffed at as a concern. The only time this is a real issue is when one variable can be written as an exact linear function of others in the model (a male dummy variable would be exactly equal to a constant/intercept term minus a female dummy variable; hence, you can't have all three in your model). A prime example is Goldberger's comparison to "micronumerousity."

Perfect multicolinearity means that your model cannot be estimated; (not perfect) multicolinearity often leads to large standard errors, but no bias or real problems; heteroskedasticity means that your standard errors are incorrect and your estimates are inefficient.

First, I would create a model that yields the parameter estimates as I want to interpret them (level change, percent change, etc.) by using logs as appropriate. Then, I would test for heteroskedasticity. The most accepted option is to simply use robust standard errors to give you correct standard errors, but for inefficient parameter estimates. Alternatively, you can use weighted least squares to get efficient estimates, but this has become less common unless you know the relationship between the variances of your observations (they each depend upon the size of the observation---like population of a country). Indeed, in cross section econometrics using a data set of any real size, robust standard errors have become required irrespective of the outcome of a BP test; they are applied almost automatically.

There isn't a good test for endogeneity. You're real problem is that the regressor is correlated with the error; OLS will force the regressor to be uncorrelated with the residual. So you won't find any correlation there. Endogeneity is what makes econometrics hard and is a whole topic unto itself.

Related Solutions

Solved – Resolving heteroskedasticity in linear regression models

Perhaps you can use the rlm() function from the MASS package. I believe it's intended for datasets affected by outliers, but it's at least worth a try with your dataset.

Here is a paper that describes the details of the function.

Description: Fit a linear model by robust regression using an M estimator.

Details: Fitting is done by iterated re-weighted least squares (IWLS).

# Package examples
require(MASS)
summary(rlm(stack.loss ~ ., stackloss))
rlm(stack.loss ~ ., stackloss, psi = psi.hampel, init = "lts")
rlm(stack.loss ~ ., stackloss, psi = psi.bisquare)

# Using the diamonds data set
diam_rlm <- rlm(price ~ ., diamonds)
summary(diam_rlm)

# A plot as example
ggplot(diamonds, aes(carat,price)) +
  geom_point() +
  stat_smooth(method = "rlm") +
  xlim(0,2.9)

enter image description here

For the diamonds dataset, i'm sure you could improve the model with a transformation. However, as you aren't saying this works with your data i left the diamonds data "as is".

Solved – Groupwise heteroskedasticity

This is essentially a "groupwise heteroskedastic" model, and it doesn't matter whether the three $Y_i$'s and the three $X_i$'s may contain different random variables (and not just different realizations of the same random variables). All that matters is that
a) The coefficient vector is assumed identical,
b) The regressor matrices have the same column dimension, and
c) The $\mathbf U_1, \mathbf U_2, \mathbf U_3$ are assumed independent.

Then it is business as usual.

To solve it, we stack our data set per equation, i.e. all observations pertaining to $Y_1$ equation first, etc, obtaining a $T\times K$ regressor matrix, $T= T_1+T_2+T_3$. We have the model

$$\mathbf y = \mathbf X\beta + \mathbf u \tag{1}$$

Mind you, we don't really care if each equation has different regressors The model is heteroskedastic, with $T \times T$ conditional covariance matrix

$$E(\mathbf u \mathbf u' \mid \mathbf X)=\mathbf V = \text{diag}(\sigma^2_1,...,\sigma^2_2,...,\sigma^2_3,...) \tag{2}$$ with appropriate sub-dimensions. This matrix is symmetric and positive definite, so, as a general matrix-algebra result, there exists a decomposition

$$\mathbf V^{-1} = \mathbf C'\mathbf C = \text{diag}(1/\sigma^2_1,...,1/\sigma^2_2,...,1/\sigma^2_3,...) \tag{3}$$

which also gives

$$\mathbf V = \mathbf C^{-1} (\mathbf C')^{-1} \tag{4}$$

So in our case $$\mathbf C = \text{diag}(1/\sigma_1,...,1/\sigma_2,...,1/\sigma_3,...) \tag{5}$$

where $\mathbf C$ is non-singular $T \times T$ square matrix. Consider the linear transformation of the model by premultiplication by $\mathbf C$:

$$\mathbf {\tilde y} = \mathbf C \mathbf y, \;\; \mathbf {\tilde X}= \mathbf C \mathbf X,\;\; \mathbf {\tilde u}= \mathbf C \mathbf u$$

This transformation makes the error term conditionally homoskedastic, since

$$E[\mathbf {\tilde u}\mathbf {\tilde u}'\mid \mathbf X] = \mathbf C E(\mathbf u\mathbf u'\mid \mathbf X) \mathbf C' = \mathbf C \mathbf V \mathbf C' = \mathbf C \mathbf C^{-1} (\mathbf C')^{-1} \mathbf C' = \mathbf I_T$$

and one can verify that all other properties of the linear regression model are satisfied. So running OLS on the transformed model is optimal (in the usual sense). We will obtain

$$\hat \beta^{(tr)}_{OLS} = \left(\mathbf {\tilde X}'\mathbf {\tilde X}\right)^{-1}\mathbf {\tilde X}'\mathbf {\tilde y} = \left(\mathbf X'\mathbf C' \mathbf C\mathbf X\right)^{-1}\mathbf X' \mathbf C' \mathbf C\mathbf y = \left(\mathbf X'\mathbf V^{-1}\mathbf X\right)^{-1}\mathbf X'\mathbf V^{-1}\mathbf y $$

But this is equivalent to run GLS on the un-transformed model,

$$\beta_{GLS} = \left(\mathbf X'\mathbf V^{-1}\mathbf X\right)^{-1}\mathbf X'\mathbf V^{-1}\mathbf y $$

and its conditional variance is easy to derive as

$$\text{Var}\left[\hat \beta_{GLS}\right] = \left(\mathbf X'\mathbf V^{-1}\mathbf X\right)^{-1}$$

Best Answer

Related Solutions

Solved – Resolving heteroskedasticity in linear regression models

Solved – Groupwise heteroskedasticity

Related Question