Solved – What does it mean when two categorical variables are independent

categorical dataterminology

My textbook gives the definition of two independent categorical variables to be:

Two categorical variables are said to be independent if the conditional distributions on one variable are identical for every category of the other variable. Otherwise, the variables have an association, or relationship.

I am trying to grasp the concept of an independent categorical variable with the following example:

Let's suppose that I gather a nationally representative sample of adult Americans, and I ask: "In general, how do you feel about things – would you say you feel optimistic, neutral, or pessimistic most of the time?"

I then compare the responses of men and women on the question.

Now, I have two categorical variables – gender and the response. In order for gender and response to be independent, then would the percent of women who answered "optimistic" have to equal the percent of men who answered "optimistic", the percent of women who answered "neutral" have to equal the percent of men who answered "neutral", and so on?

Any help is greatly appreciated.

Best Answer

You interpret this correct. Yes, in your example for there to be no gender difference (for gender and survey responding to be independent from one another) the % of different responses should be roughly the same for men and women.

Related Solutions

Multiple Regression – Interpreting Betas When There Are Multiple Categorical Variables

You are right about the interpretation of the betas when there is a single categorical variable with $k$ levels. If there were multiple categorical variables (and there were no interaction term), the intercept ($\hat\beta_0$) is the mean of the group that constitutes the reference level for both (all) categorical variables. Using your example scenario, consider the case where there is no interaction, then the betas are:

$\hat\beta_0$: the mean of white males
$\hat\beta_{\rm Female}$: the difference between the mean of females and the mean of males
$\hat\beta_{\rm Black}$: the difference between the mean of blacks and the mean of whites

We can also think of this in terms of how to calculate the various group means:
\begin{align} &\bar x_{\rm White\ Males}& &= \hat\beta_0 \\ &\bar x_{\rm White\ Females}& &= \hat\beta_0 + \hat\beta_{\rm Female} \\ &\bar x_{\rm Black\ Males}& &= \hat\beta_0 + \hat\beta_{\rm Black} \\ &\bar x_{\rm Black\ Females}& &= \hat\beta_0 + \hat\beta_{\rm Female} + \hat\beta_{\rm Black} \end{align}

If you had an interaction term, it would be added at the end of the equation for black females. (The interpretation of such an interaction term is quite convoluted, but I walk through it here: Interpretation of interaction term.)

Update: To clarify my points, let's consider a canned example, coded in R.

d = data.frame(Sex  =factor(rep(c("Male","Female"),times=2), levels=c("Male","Female")),
               Race =factor(rep(c("White","Black"),each=2),  levels=c("White","Black")),
               y    =c(1, 3, 5, 7))
d
#      Sex  Race y
# 1   Male White 1
# 2 Female White 3
# 3   Male Black 5
# 4 Female Black 7

enter image description here

The means of y for these categorical variables are:

aggregate(y~Sex,  d, mean)
#      Sex y
# 1   Male 3
# 2 Female 5
## i.e., the difference is 2
aggregate(y~Race, d, mean)
#    Race y
# 1 White 2
# 2 Black 6
## i.e., the difference is 4

We can compare the differences between these means to the coefficients from a fitted model:

summary(lm(y~Sex+Race, d))
# ...
# Coefficients:
#             Estimate Std. Error  t value Pr(>|t|)    
# (Intercept)        1   3.85e-16 2.60e+15  2.4e-16 ***
# SexFemale          2   4.44e-16 4.50e+15  < 2e-16 ***
# RaceBlack          4   4.44e-16 9.01e+15  < 2e-16 ***
# ...
# Warning message:
#   In summary.lm(lm(y ~ Sex + Race, d)) :
#   essentially perfect fit: summary may be unreliable

The thing to recognize about this situation is that, without an interaction term, we are assuming parallel lines. Thus, the Estimate for the (Intercept) is the mean of white males. The Estimate for SexFemale is the difference between the mean of females and the mean of males. The Estimate for RaceBlack is the difference between the mean of blacks and the mean of whites. Again, because a model without an interaction term assumes that the effects are strictly additive (the lines are strictly parallel), the mean of black females is then the mean of white males plus the difference between the mean of females and the mean of males plus the difference between the mean of blacks and the mean of whites.

Solved – How to set up ANCOVA with two categorical variables

ANCOVA is just a limited regression model. So, you can simply run a regression model, which allows for continuous and categorical variables. In R, your analysis might look like this:

fit <- lm(Mass ~ day + location + day:location + sex, data = goose.data)
summary(fit)

Best Answer

Related Solutions

Multiple Regression – Interpreting Betas When There Are Multiple Categorical Variables

Solved – How to set up ANCOVA with two categorical variables

Related Question