Solved – Does one need to transform percentages/proportions for a multiple linear regression

data transformationgeneralized linear modellogitmultiple regressionregression

I am aware that one should transform percentages and proportions when using them in an ANOVA, due to the values being bounded by 0 and 1. I have seen suggestions that the best transformations are logit and arcsine (with benefits/problems with both).

However, I have two linked questions about a multiple linear regression.

1) Does one still need to transform the percentages and proportions when using them as predictor variables in a multiple linear regression? Or can they be left in their raw form?

2) How about when using percentages/proportions as an outcome variable in a multiple linear regression?

Clarification: As discussed in my original question, I am particularly interested in whether the guidance depends on the percentages/proportions being used as an outcome or predictor variable in a linear regression.

Best Answer

It is less of an issue whether a variable is expressed as a percentage then the underlying distribution of that variable and the residuals of linear regression. In fact, it may be argued that most variables measured are in some way bounded (eg max possible temperature) and discrete. In some cases proportional variables lend themselves to linear regression without transformations and in some cases they can be so clustered or skewed that none of the transformations can mitigate that. Arcsine and logit will work for intermediate cases, particularly when there are a lot of values close to 0 and 1.

Related Solutions

Solved – When to transform predictor variables when doing multiple regression

I take your question to be: how do you detect when the conditions that make transformations appropriate exist, rather than what the logical conditions are. It's always nice to bookend data analyses with exploration, especially graphical data exploration. (Various tests can be conducted, but I'll focus on graphical EDA here.)

Kernel density plots are better than histograms for an initial overview of each variable's univariate distribution. With multiple variables, a scatterplot matrix can be handy. Lowess is also always advisable at the start. This will give you a quick and dirty look at whether the relationships are approximately linear. John Fox's car package usefully combines these:

library(car)
scatterplot.matrix(data)

Be sure to have your variables as columns. If you have many variables, the individual plots can be small. Maximize the plot window and the scatterplots should be big enough to pick out the plots you want to examine individually, and then make single plots. E.g.,

windows()
plot(density(X[,3]))
rug(x[,3])
windows()
plot(x[,3], y)
lines(lowess(y~X[,3]))

After fitting a multiple regression model, you should still plot and check your data, just as with simple linear regression. QQ plots for residuals are just as necessary, and you could do a scatterplot matrix of your residuals against your predictors, following a similar procedure as before.

windows()
qq.plot(model$residuals)
windows()
scatterplot.matrix(cbind(model$residuals,X))

If anything looks suspicious, plot it individually and add abline(h=0), as a visual guide. If you have an interaction, you can create an X[,1]*X[,2] variable, and examine the residuals against that. Likewise, you can make a scatterplot of residuals vs. X[,3]^2, etc. Other types of plots than residuals vs. x that you like can be done similarly. Bear in mind that these are all ignoring the other x dimensions that aren't being plotted. If your data are grouped (i.e. from an experiment), you can make partial plots instead of / in addition to marginal plots.

Hope that helps.

Solved – Are ecologists the only ones who didn’t know that the arcsine is asinine

I teach it to public health students for two reasons:

one of my colleagues teach it (in the introduction course) as magic recipe, I show them the Delta method and how it is derived;
I think the Delta method and variance stabilizing transformations are not asinine and can be useful. The confidence interval computed using arcsin transform with correction of continuity is not perfect but behaves reasonnably well, and for small samples it is much much better¹ than the Wald procedure, which is still widely used.

As John for psychology and neuroscience, I think many people in epidemiology don’t even care, they just use linear models in a push-button way.

¹ Pires, Amado, 2008. Interval estimators for a binomial proportion.

Best Answer

Related Solutions

Solved – When to transform predictor variables when doing multiple regression

Solved – Are ecologists the only ones who didn’t know that the arcsine is asinine

Related Question