Solved – Spread-Level Plot versus Power Transformation Functions in R

data transformationheteroscedasticityrtime series

I'm having trouble interpreting the results from the Spread-Level Plot function in R (car package). The documentation says:

PowerTransformation
spread-stabilizing power transformation, calculated as 1 – slope of the line fit to the plot.

This is not explicit enough for me. Should this transformation be applied to every variable in the regression?

For example, assume I have an lm object given by:

myFit <- lm(y ~ x1 + x2)

Then I use Spread-Level Plot:

slp(myFit)

If the 'suggested power transformation' is 0.5, then does that imply a homoscedastic model could be fit using one of the following?

refitA <- lm(sqrt(y) ~ sqrt(x1) + sqrt(x2))
refitB <- lm(sqrt(y) ~ x1 + x2)
refitC <- lm(sqrt(y) ~ sqrt(x1 + x2))

If I understand correct, refitA would be the suggested model to approximate homoscedasticity. On the other hand, if I only want to transform the LHS, I would use the powerTransform function (also from car package). i.e., an "estimated transform parameter" of 0.5 from the powerTransform function would imply that refitB is homoscedastic.

Is this correct?

Thanks!

Best Answer

The idea is to identify a possible transformation of the response to improve the heteroskedasticity, assuming the model fitted well enough for the spread and level to have been sufficiently accurately estimated.

Which is to say, try refitB, but beware that if the original model was reasonably linear in unstransformed X, the new one generally won't be.

Things to watch out for: possible need to transform X as well, interaction where there wasn't any, or loss of interaction where there was.

If the noise level is high you may not be able to easily tell linear from not-linear though, at least not without something like a loess fit or other similar smooth to pick it out from the noise.

Related Solutions

Solved – Heteroskedasticity in linear regression model & data transformation

Actually, I'd say just the opposite. Multicolinearity is often scoffed at as a concern. The only time this is a real issue is when one variable can be written as an exact linear function of others in the model (a male dummy variable would be exactly equal to a constant/intercept term minus a female dummy variable; hence, you can't have all three in your model). A prime example is Goldberger's comparison to "micronumerousity."

Perfect multicolinearity means that your model cannot be estimated; (not perfect) multicolinearity often leads to large standard errors, but no bias or real problems; heteroskedasticity means that your standard errors are incorrect and your estimates are inefficient.

First, I would create a model that yields the parameter estimates as I want to interpret them (level change, percent change, etc.) by using logs as appropriate. Then, I would test for heteroskedasticity. The most accepted option is to simply use robust standard errors to give you correct standard errors, but for inefficient parameter estimates. Alternatively, you can use weighted least squares to get efficient estimates, but this has become less common unless you know the relationship between the variances of your observations (they each depend upon the size of the observation---like population of a country). Indeed, in cross section econometrics using a data set of any real size, robust standard errors have become required irrespective of the outcome of a BP test; they are applied almost automatically.

There isn't a good test for endogeneity. You're real problem is that the regressor is correlated with the error; OLS will force the regressor to be uncorrelated with the residual. So you won't find any correlation there. Endogeneity is what makes econometrics hard and is a whole topic unto itself.

Solved – Initialize ARIMA simulations with different time-series

You can "fit" the model to different data and then simulate:

m2 <- Arima(z,model=m1)
simulate.Arima(m2,future=TRUE,bootstrap=TRUE)

m2 will have the same parameters as m1 (they are not re-estimated), but the residuals, etc., are computed on the new data.

However, I am concerned with your model. Seasonal models are for when the seasonality is fixed and known. With animal population data, you almost certainly have aperiodic population cycling. This is a well-known phenomenon and can easily be handled with non-seasonal ARIMA models. Look at the literature on the Canadian lynx data for discussion.

By all means, use the square root, but then I would use a non-seasonal ARIMA model. Provided the AR order is greater than 1, it is possible to have cycles. See

You can do all this in one step:

m1 <- auto.arima(y, lambda=0.5)

Then proceed with your simulations as above.