Q: " ... how do I interpret the x2 value "High"? For example, what effect does "High" x2s have on the response variable in the example given here??
A: You have no doubt noticed that there is no mention of x2="High" in the output. At the moment x2High is chosen as the "base case". That's because you offered a factor variable with the default coding for levels despite an ordering that would have been L/M/H more naturally to the human mind. But "H" being lexically before both "L" and "M" in the alphabet, was chosen by R as the base case.
Since 'x2' was not ordered, each of the reported contrasts were relative to x2="High" and so x2=="Low" was estimated at -0.78 relative to x2="High". At the moment the Intercept is the estimated value of "Y" when x2="High" and x1= 0. You probably want to re-run your regression after changing the levels ordering (but not making the factor ordered).
x2a = factor(x2, levels=c("Low", "Medium", "High"))
Then your 'Medium' and 'High' estimate will be more in line with what you expect.
Edit: There are alternative coding arrangements (or more accurately arrangements of the model matrix.) The default choice for contrasts in R is "treatment contrasts" which specifies one factor level (or one particular combination of factor levels) as the reference level and reports estimated mean differences for other levels or combinations. You can, however have the reference level be the overall mean by forcing the Intercept to be 0 (not recommended) or using one of the other contrast choices:
?contrasts
?C # which also means you should _not_ use either "c" or "C" as variable names.
You can choose different contrasts for different factors, although doing so would seem to impose an additional interpretive burden. S-Plus uses Helmert contrasts by default, and SAS uses treatment contrasts but chooses the last factor level rather than the first as the reference level.
Question:
Can Dummy variables have overlapping categories?
Answer:
No.
Explanation:
Dummy variables arise when you try to recode Categorical variables with more than two categories into a series of binary variables. Since these categories partition your dataset (i.e. each observation can be assigned to one and only one of these 'k' categories), there is no way that there can be any "overlapping".
Now, with respect to the actual example you provide, there are two issues you should be aware of since they probably would otherwise screw up your analysis entirely:
- The binary variables which you describe are based, more or less, on arbitrary distinctions (for instance, would astroturf--more or less a rug covering concrete--really qualify as "soft" ground?).
- There's a good chance your model (as described in the OP) suffers from Multicollinearity (that is, that a linear combination of two or more of your independent variables are highly correlated).
Just something you should keep in mind the next time you run a regression... Anyway, hope this helps.
Best Answer
Don't mistake "discrete" for "categorical". To me, the latter definitely implies a lack of order, while your values certainly have order (a test score of $40$ is higher than a test score of $30$).
Consequently, you might find yourself interested in an ordinal regression model, such as the
rms::orm
function written by the Frank Harrell who commented on your post. I don't know what would be equivalent in Python, but it must exist.If you want to use OLS to stick with a simple model, don't let Python force you to use a categorical outcome. Depending on the package you use, such as
statsmodels
orsklearn
, there is no issue with this kind of data. The OLS functions in those two packages are perfectly consistent with the usual OLS estimator, $ \hat\beta = (X^TX)^{-1}X^Ty $, with the exception that you have to tellsklearn
not to use regularization (which is the default, though newer versions ofsklearn
allow regularization to be turned off).