You want all possible pairwise comparisons of levels, but there are much more pairs than there are degrees of freedom in the factor. Say the factor has five levels, then you need 4 parameters to code it, but there are $\binom{5}{2}$ pairs, that is, 10 pairs. So it is imposible to find a coding with one parameter for each comparison.
The solution is to use whatever coding you wants, and then compute the 10 pairwise contrasts afterwards, after estimating the model, from the model output. In R, for instance, this could be done many ways , either "by hand", or with the use of packages like contrast
or multcomp
.
Below an R example, done "by hand", for confidence intervals of all pairwise comparisions:
xfac <- factor(rep(1:5, each=10))
y <- rnorm(50, mean=c(rep(0, 20), rep(1, 30)), sd=2)
mod <- lm( y ~ 0 + xfac)
# generating a hypothesis contrasts matrix with 10 rows:
# each row is one contrast:
cmat <- matrix(0, 10, 5)
nam <- character(length=10)
row <- 0
for (i in 1:4) for (j in (i+1):5) {
row <- row+1
nam[row] <- paste("x[", i, "]-x[", j, "]", sep="")
cmat[row, c(i, j)] <- c(1, -1)
}
rownames(cmat) <- nam
# We write a contrast testing function by hand:
my.contrast <- function(mod, cmat) {
co <- coef(mod)
CV <- vcov(mod)
se <- sqrt( diag( cmat %*% CV %*% t(cmat) ))
df <- mod$df.residual
contr <- cmat %*% co
ul <- qt(0.975, df=df)
ci <- cbind(contr-ul*se, contr+ul*se)
ci
}
And then using it gives the result:
> my.contrast(mod, cmat)
[,1] [,2]
x[1]-x[2] -1.946376 1.7921298
x[1]-x[3] -3.044916 0.6935897
x[1]-x[4] -2.136283 1.6022227
x[1]-x[5] -2.301393 1.4371135
x[2]-x[3] -2.967793 0.7707130
x[2]-x[4] -2.059160 1.6793460
x[2]-x[5] -2.224269 1.5142368
x[3]-x[4] -0.960620 2.7778861
x[3]-x[5] -1.125729 2.6127769
x[4]-x[5] -2.034362 1.7041439
The easiest way is to break your data down into eight groups. You can do this by writing an occupation vector $x=(v_1,v_2,\cdots,v_7)$ where $v_i=1$ if you're in category $i$ and 0 otherwise. You need to set a reference category correponding to $v=(0,0,0,\cdots,0)$, say "lawyer". Then a simple logistic regression should do the job. Then you have a linear response: $Y=X\beta$ which you map with a logit function to a probability. This way you can see which categories influence more than others.
Best Answer
First off, have a look at this question and answer that is close to what you are asking.
If you asume that
Age
orMPG
are valued differently in Asia and Europe, then simply adding the dummy variable into the model does not solve this. The dummy only captures the level effect and not the slope effect. You can see this because the dummy does not show up in the derivative $\frac{\partial Price}{\partial Age}$.Without loss of generality, assume that there are only two groups such that K=2 and one explanatory variable.
The model is thus $y_i=\alpha+\beta_x*X_i+u_i$ where you create a dummy variable $D$ such that you code the group 1 or group 2.
Essentially, you have several choices of models:
When splitting the dataset in two parts, you have the following:
When fully interacting your model, it looks like this:
$y_i=\alpha+\beta_D*D+\beta_x*X_i+\beta_{Dx}*D*X_i+u_i$ (4)
You have the following:
Notice that splitting or fully interacting involves differences for the estimation of the variance-covariance matrix of regressors. When fully interacting, you also come across the problem that the number of regressors rapidly increases. These are issues you need to take into account, too.