OLS Regression – Handling Categorical Outcome Variable with 50 Categories

categorical dataleast squareslogisticregression

I'd like to do multiple regressions of test scores against a few continuous variables (e.g. age, clinical measures) but the test scores are discrete values (1-10 or 1-50). The function I wanted to use in Python didn't work because the outcome variable was assumed to be categorical, but in JASP I just set the variable type to continuous and it worked. However now I'm concerned that even though it "worked" the results are not reliable. Is it possible to say what problems could come up? E.g. can I trust the results, are they just going to be estimated in an inefficient way or could they be just wrong?
Also what would be a good alternative? I don't want to use dummy coding because it is just too many categories and doesn't seem right.

Best Answer

Don't mistake "discrete" for "categorical". To me, the latter definitely implies a lack of order, while your values certainly have order (a test score of $40$ is higher than a test score of $30$).

Consequently, you might find yourself interested in an ordinal regression model, such as the rms::orm function written by the Frank Harrell who commented on your post. I don't know what would be equivalent in Python, but it must exist.

If you want to use OLS to stick with a simple model, don't let Python force you to use a categorical outcome. Depending on the package you use, such as statsmodels or sklearn, there is no issue with this kind of data. The OLS functions in those two packages are perfectly consistent with the usual OLS estimator, $ \hat\beta = (X^TX)^{-1}X^Ty $, with the exception that you have to tell sklearn not to use regularization (which is the default, though newer versions of sklearn allow regularization to be turned off).