Solved – Percentage as dependent variable in multiple linear regression

logisticlogitmultiple regressionregression

Although I saw a few similar threads, I don't believe I saw the specific answer to the following question:

For simple linear or multiple linear regression, if your dependent variable is a percentage, are any assumptions violated? I know that Y should be continuous, but does it also technically have to be unbounded? I've never seen this listed as one of the assumptions, though I understand how a bounded dependent variable can cause specific issues.

In my case, I'm doing a multiple regression project for school where the dependent variable is percentage of obese schoolchildren. Should I do a logit transformation or beta-regression because Y is bounded?


In response to a comment: the kernel density plot for Y(pct_obese) is below: It doesn't seem that there is bunching at the boundaries–rather, the bulk of the data hovers around 20%:
kernel density plot

Best Answer

You should not use linear regression here, nor should you transform your data with the logit transformation. You have a percentage variable in a sense, but that's just a way to display your data in a simplified manner. In another sense, you have a count of obese children out of a known total of kids. That is, you have binomial data.

Thus, you should use logistic regression, using the counts of actual children. How that will be done, exactly, depends on how your software implements this, for a discussion of SAS and R, see: Difference in output between SAS's proc genmod and R's glm. People often think of logistic regression as the option to use when your response is 0/1, but it is actually applicable to any binomial distribution, even when there is more than one Bernoulli trial.