Solved – How to calculate the coefficient of a dumthe variable reference category

categorical datacategorical-encodingmultiple regressionregressionregression coefficients

I am currently building a regression model with numerous continuous, categorical (employing dummies) and interaction variables. I understand we must use k-1 dummies with one variable becoming the reference category and then the impact of the other dummies can be reported on relative to this reference category.

I would however like to specifically identify the coefficient for the reference category. In this instance the reference dummy will be the UK with the dependant being fund returns. I have 6 country dummies and at present I can only say (for example) German funds underperformed the UK by 3%. Could any of you please advise me on how to ascertain the specific coefficient value for a reference dummy variable so I may be able to say UK funds performed x% throughout the period?

Best Answer

If the response variable is in the units in which you are interested (annual performance percentage), then what you really want to do is to take advantage of the intercept in your regression model. As one comment notes, the regression coefficient for UK per se is 0, by construction, if that is the reference category.

With the treatment contrasts that you seem to be using to do your analysis (comparing all levels of each categorical predictor against a reference category), that intercept will be the value of the response variable when all predictor categories are at their reference level and all continuous predictors are at 0. In particular, it will represent that situation specifically for UK funds. (The 0 coefficient for UK means you add 0 to the intercept to get the value for UK.)

You can then use the regression coefficients to add in the contributions from all the other predictors to get the response value for UK under other combinations of predictor values. For error estimates you incorporate information from the covariance matrix of the regression coefficients, using the formula for the variance of a sum of correlated variables.

This assumes, however, that there is no interaction term involving your categorical variable country. If there is, then your interpretation of the 3% coefficient for Germany is incomplete: it represents the difference between Germany and UK only at the reference values of all other categorical variables and at 0 values of all continuous variables. You must also add in the contributions of all the interaction terms to compare Germany and UK in any other scenario.

Question:

Can Dummy variables have overlapping categories?

Answer:

No.

Explanation:

Dummy variables arise when you try to recode Categorical variables with more than two categories into a series of binary variables. Since these categories partition your dataset (i.e. each observation can be assigned to one and only one of these 'k' categories), there is no way that there can be any "overlapping".

Now, with respect to the actual example you provide, there are two issues you should be aware of since they probably would otherwise screw up your analysis entirely:

The binary variables which you describe are based, more or less, on arbitrary distinctions (for instance, would astroturf--more or less a rug covering concrete--really qualify as "soft" ground?).
There's a good chance your model (as described in the OP) suffers from Multicollinearity (that is, that a linear combination of two or more of your independent variables are highly correlated).

Just something you should keep in mind the next time you run a regression... Anyway, hope this helps.

Solved – Interpretation of logistic regression intercept with one dumthe coded categorical variable

I think you are making this hard on yourself. Make sure race is a factor variable so that the software provides the overall $\chi^2$ of association with $k-1$ d.f. for $k$ categories. Coding doesn't affect the value of $\chi^2$. Don't use a stepwise process for making inference about the importance of race. Use the overall "chunk" test as described above, which has a built-in perfect multiplicity adjustment besides being invariant to coding. In R this would look like (for a binary or ordinal logistic model predicting $Y$):

require(rms)
f <- lrm(Y ~ rcs(age, 4) + race)
anova(f)   # 3 d.f. test for age, k-1 for race
# also prints 2 d.f. test of linearity in age
# age fit is restricted cubic spline with 4 default knots

When doing multiple imputation with the Hmisc package aregImpute function or with the mice package, you would substitute the following for the 2nd line above:

f <- fit.mult.impute(Y ~ rcs(age, 4) + race, lrm, impute_object, n.impute=20)

which would adjust the covariance matrix for multiple imputation [n.impute recommended to be the percent of observations that have any variable missing].

Best Answer

Related Solutions

Solved – dumthe variables with overlapping categories

Question:

Answer:

Explanation:

Solved – Interpretation of logistic regression intercept with one dumthe coded categorical variable

Related Question