OLS Regression – Handling Categorical Outcome Variable with 50 Categories

categorical dataleast squareslogisticregression

I'd like to do multiple regressions of test scores against a few continuous variables (e.g. age, clinical measures) but the test scores are discrete values (1-10 or 1-50). The function I wanted to use in Python didn't work because the outcome variable was assumed to be categorical, but in JASP I just set the variable type to continuous and it worked. However now I'm concerned that even though it "worked" the results are not reliable. Is it possible to say what problems could come up? E.g. can I trust the results, are they just going to be estimated in an inefficient way or could they be just wrong?
Also what would be a good alternative? I don't want to use dummy coding because it is just too many categories and doesn't seem right.

Best Answer

Don't mistake "discrete" for "categorical". To me, the latter definitely implies a lack of order, while your values certainly have order (a test score of $40$ is higher than a test score of $30$).

Consequently, you might find yourself interested in an ordinal regression model, such as the rms::orm function written by the Frank Harrell who commented on your post. I don't know what would be equivalent in Python, but it must exist.

If you want to use OLS to stick with a simple model, don't let Python force you to use a categorical outcome. Depending on the package you use, such as statsmodels or sklearn, there is no issue with this kind of data. The OLS functions in those two packages are perfectly consistent with the usual OLS estimator, $ \hat\beta = (X^TX)^{-1}X^Ty $, with the exception that you have to tell sklearn not to use regularization (which is the default, though newer versions of sklearn allow regularization to be turned off).

Related Solutions

R Linear Regression – Categorical Variable “Hidden” Value in Linear Regression

Q: " ... how do I interpret the x2 value "High"? For example, what effect does "High" x2s have on the response variable in the example given here??

A: You have no doubt noticed that there is no mention of x2="High" in the output. At the moment x2High is chosen as the "base case". That's because you offered a factor variable with the default coding for levels despite an ordering that would have been L/M/H more naturally to the human mind. But "H" being lexically before both "L" and "M" in the alphabet, was chosen by R as the base case.

Since 'x2' was not ordered, each of the reported contrasts were relative to x2="High" and so x2=="Low" was estimated at -0.78 relative to x2="High". At the moment the Intercept is the estimated value of "Y" when x2="High" and x1= 0. You probably want to re-run your regression after changing the levels ordering (but not making the factor ordered).

x2a = factor(x2, levels=c("Low", "Medium", "High"))

Then your 'Medium' and 'High' estimate will be more in line with what you expect.

Edit: There are alternative coding arrangements (or more accurately arrangements of the model matrix.) The default choice for contrasts in R is "treatment contrasts" which specifies one factor level (or one particular combination of factor levels) as the reference level and reports estimated mean differences for other levels or combinations. You can, however have the reference level be the overall mean by forcing the Intercept to be 0 (not recommended) or using one of the other contrast choices:

?contrasts
?C   # which also means you should _not_ use either "c" or "C" as variable names.

You can choose different contrasts for different factors, although doing so would seem to impose an additional interpretive burden. S-Plus uses Helmert contrasts by default, and SAS uses treatment contrasts but chooses the last factor level rather than the first as the reference level.

Solved – dumthe variables with overlapping categories

Question:

Can Dummy variables have overlapping categories?

Answer:

No.

Explanation:

Dummy variables arise when you try to recode Categorical variables with more than two categories into a series of binary variables. Since these categories partition your dataset (i.e. each observation can be assigned to one and only one of these 'k' categories), there is no way that there can be any "overlapping".

Now, with respect to the actual example you provide, there are two issues you should be aware of since they probably would otherwise screw up your analysis entirely:

The binary variables which you describe are based, more or less, on arbitrary distinctions (for instance, would astroturf--more or less a rug covering concrete--really qualify as "soft" ground?).
There's a good chance your model (as described in the OP) suffers from Multicollinearity (that is, that a linear combination of two or more of your independent variables are highly correlated).

Just something you should keep in mind the next time you run a regression... Anyway, hope this helps.