Solved – Check transformations over variables in model

data transformationmodelmultiple regressionrregression

I've got this model:

model <- lm (time~radius_mean+texture_mean+perimeter_mean+area_mean
             +smoothness_mean+compactness_mean+concavity_mean
             +concave_points_mean+symmetry_mean+fractal_dimension_mean+radius_se
             +texture_se+perimeter_se+area_se+smoothness_se+compactness_se
             +concavity_se+concave_points_se+symmetry_se+fractal_dimension_se
             +radius_worst+texture_worst+perimeter_worst+area_worst+smoothness_worst
             +smoothness_worst+compactness_worst+concavity_worst+concave_points_worst
            +symmetry_worst+fractal_dimension_worst+tumor_size+lymph_node, model)
summary(model)

And I want to check what transformations should I do over the variables. I tried a Box-Cox to check if a transformation over the response variable would be necessary:

require(MASS)
boxcox(model, plotit=T)
boxcox(model, plotit=T, lambda=seq(0.2,0.7,by=0.05))

But the graph says no. At this point, how can I check if a transformation over the independent variables is necessary?


Thank you for your answer. Maybe I explained myself wrong. I just want to consider if it is necessary to make any transformation over a variable and what alternative models would should be applied.

The point is that I am fully lost about transformation and I don't know how to check it with this huge amount of variables.


Thank you! I've done what you said. Here is a graph: enter image description here

I guess I should make a response transformation. So what would be the next point for this?

Best Answer

I am going to broaden the question by pointing out some things that worry me, some enormously, about your project.

  1. Trivially, smoothness_worst is listed twice as a predictor.

  2. Assuming that to be fixed, you are still throwing in 32 predictors into your model! That isn't anything except a recipe for poor statistical science. The predictors are a ragbag of size and shape measures of various kinds. Some have dimensions, but there is no dimensional thinking evident in your choice of predictors. For example, if some response, time in your case, is linear in area, it is most unlikely to be linear in perimeter too.

  3. Without knowing anything specific about your application, except a hint from some names that it might be in oncology, I am prepared to bet that you really need to thin down your predictors because they mean anything, many will be highly correlated with each other. Think in terms of groups of linear size measures, area size measures, shape measures, etc. and look carefully at their correlation.

  4. Most crucially of all, perhaps, I would guess that time is in practice always positive, in which case it's unlikely that linear regression really is the default model of choice. See e.g. this lucid post for an introduction to the argument that generalised linear models with logarithmic link are the starting point in modelling any response of this kind.

Box-Cox, its wonderful name apart, is in my view oversold. Worrying about marginal distributions should take second place in regression to choosing a functional form that makes sense for the science and statistics of your problem. That doesn't rule out transforming some of the predictors, say on dimensional grounds.

Linear regression isn't a washing machine that takes in dirty, messy data and removes the dirt and mess. You have to think your way towards a model that does justice to your data and the underlying science too.