Solved – In linear regression, data is highly skewed, transformation doesn’t work..!

heteroscedasticitymodel selectionmultiple regressionskewness

I have dataset with 9524 observations / 97 variables.

Most of variables are numerical, and some of factor variables (Yes/no or several levels)

I want to perform multiple linear regression with this dataset.

  1. First step was to do log transformation on the data, since the data is highly skewed. (I did log(x+1) because of many have 0)

These are the histograms of my data after log transformation.
enter image description here

"API05B" will be dependent variable for the linear regression.

There's more variables, and most of them are heavily skewed. (mostly right or some left)

  1. Anyway, I tried to keep performing regression to see the results.
    With 'regsubsets' – forward selection, I tried to select best predictors (or best model) among those variables.

enter image description here
enter image description here

I chose the number of predictors when showing significant changes of BIC or Cp (size of bias), which the number of predictors was 3.

The following is the histogram of the variables selected by regsubsets.

enter image description here

I tried to not take care of the distribution of variables, since there's no assumptions for the distribution of data in linear regression.

But, I worried about the skewness would impact on the heteroskedasticity.

  1. Diagnosis for the linear model – outliers / multicollinearity / heteroskedasticity

I removed outliers / correct the multicollinearity with VIF.

There was no problem with these, but..

I did lmtest::bptest / lmtest::coeftest, the result saying the heteroskedasticity exist.

enter image description here
enter image description here

Box-cox transformation didn't work as well.

Here's the summary plot of my final model.

enter image description here

I've read several articles about dealing with skewed data or heteroskedasticity,

most of them saying log transformation or box-cox transformation would be helpful, but it didn't work..

Some of them recommend to not stick to linear regression, such as trying robust linear regression / zero-inflated model / two-part models etc..

Issues I want to solve..

  1. dealing with skewness of data or heteroskedasticity

    • another transformation needed?
  2. predictors selection

    • regsubsets or lasso ?
    • transformation first? or selecting predictors first?
  3. another approaches needed?

    • If none of the above would be helpful for this issue, another approaches needed as mentioned above?
  4. Is there anything wrong in my process?

Any tips or suggestions will be appreciated!! Thank you!

Best Answer

There are too many questions asked. You are welcome to break it down. And many of the questions are already answered well in this forum.

I will only address your first question here.

There's more variables, and most of them are heavily skewed. (mostly right or some left)

It is seems you may have some mis-understandings on linear regression assumptions. Linear regression does not assume independent variable / model input to be Gaussian distributed, but assume the residual.

Details can be found

Why is the normality of residuals "barely important at all" for the purpose of estimating the regression line?

Why linear regression has assumption on residual but generalized linear model has assumptions on response?

In the first link I provided, it also explains normality of residuals is not that important as you may think.

For feature selections see here

Related Question