Solved – Whether to apply the logit transformation to proportional predictor variables in a multiple linear regression? [including proportions of 0.0%]

analysisdata transformationlogitmultiple regressionregression

In a linear regression, I have a number of predictors variables that are expressed as proportions. The outcome variable is continuous. My residuals are not normally distributed, with a mild to moderate positive skew.

Should I use a logit transformation on the predictor variables to see how this impacts the residuals?

If yes, I am having trouble understanding with what to do with my cases that have a proportion of 0.0 on some of the predictor variable (e.g. a case with 0% on a variable). In particular, I've been considering the advice of Papke and Wooldridge (1996) and Baum (2008) on how to deal with those sorts of cases when they are measured responses as opposed to due to structural issues. They suggest various options including winsorisation and something called fractional logit responses. However, they only discuss their options in terms of proportional outcome variables, and never in terms of proportional predictor variables. Are such approaches appropriate to proportional predictor variables in a linear regression?

Best Answer

The Stata article you link discusses modeling an outcome as a proportion. More specifically, it discusses binomial regression, which is the same as frequency weighted logistic regression. If you don't know the denominator, nonlinear least squares or logistic regression with nonintegral outcomes and a dispersion parameter can handle the issue.

However, you mention using a proportion as a predictor. Generally, one doesn't use such transformations of a proportion. Singularities aside, it can create very high leverage values which makes regression model results a little hard to trust.

If one is interested in inference, the choice of whether or not to transform a variable depends on the question you're trying to answer. Typically, I would prefer to leave the variable untransformed so that the beta-coefficients are easier to interpret. For instance, suppose we are comparing voting preference to SES in a neighborhood based survey administered to random households in each neighborhood. If I am regressing total number of household TVs on voting preference (at the aggregate neighborhood level), I would say "Comparing neighborhoods that differ in Republican voting preference by 10%, the average number of TVs differed by X.X (95% CI X.X - X.X)".

But herein lies a caveat: what percentage are you measuring? If it is binary at the individual level, but has been aggregated up to a percentage, you would need to apply some kind of weighting to account for the difference in denominator sizes, if that is an issue. In my SES study, I may only be able to sample 3 households in one neighborhood, giving voting preference levels of only 0, 0.33, 0.66, or 1 whereas I may gather 10 households in another neighborhood. The neighborhood with more samples would use inverse frequency weighting to account for the precision.

Consider this when you consider how the "percentages" were acquired.

Related Question