I am attempting to model a response variable that is theoretically bounded between -225 and +225. The variable is the total score that subjects got when playing a game. Although theoretically it is possible for subjects to score +225. Despite this because score depended on not only the subjects actions but also the actions of another actions the maximum anyone scored was 125 (this is the highest 2 players playing each other can both score) this happened with a very high frequency. The lowest score was +35.
This boundary of 125 is causing difficulty with a linear regression. The only thing I can think of doing is re-scaling the response to be between 0 and 1 and using a beta regression. If I do this though I am not sure I can really justify saying 125 is the top boundary (or 1 after transformation) since it is possible to score +225. Furthermore, if I did this what would my bottom boundary,35?
Thanks,
Jonathan
Best Answer
Although I'm not entirely certain of what your problem with linear regression is I'm right now finishing an article about how to analyze bounded outcomes. Since I'm not familiar with Beta regression perhaps someone else will answer that option.
By your question I understand that you get predictions outside the boundaries. In this case I would go for logistic quantile regression. Quantile regression is a very neat alternative to regular linear regression. You can look at different quantiles and get a much better picture of your data than what's possible with regular linear regression. It is also has no assumptions regarding distribution1.
Transformation of a variable can often cause funny effects on linear regression, for instance you have a significance in the logistic transformation but that doesn't translate into the regular value. This is not the case with quantiles, the median is always the median regardless of the transformation function. This allows you to transform back and forth without distorting anything. Prof. Bottai suggested this approach to bounded outcomes2, its an excellent method if you want to do individual predictions but it has some issues when you wan't to look at the beta's and interpret them in a non-logistic way. The formula is simple:
$logit(y) = log(\frac{y + \epsilon}{max(y) - y + \epsilon})$
Where $y$ is your score and $\epsilon$ is an arbitrary small number.
Here's an example that I did a while ago when I wanted to experiment with it in R:
This gives the following data scatter, as you can see it is clearly bounded and inconvenient:
This results in the following picture where females are clearly above the upper boundary:
This gives the following plot with similar problems:
The logistic quantile regression that has a very nice bounded prediction:
Here you can see the issue with the Beta's that in the retransformed fashion differ in different regions (as expected):
References
For the curious the plots were created using this code: