Some immediate responses:

1) Your lecturer means that the data show autocorrelation. This leads to inefficient estimates of regression coefficients in simple linear regression. Depending on whether it was covered in your course, that's a mistake.

2) Maybe I do not understand the problem fully, but IMAO the chi-squared test of independence is used correctly here, except for two other issues:

3) Your chi-square test has an immense power, because of the sample size. It's hard not be significant even if effects were very small. Furthermore, it appears you have a census of the population. In this situation statistical inference is unnecessary, because you obseve all population units. But that's not what the lecturer remarks.

4) You seem to aggregate the data across time points. You should actually test once per time point, since otherwise you aggregate effects over time (you count units multiple times). But that's also not what the lecturer remarks.

The lecturer actually remarks that you want to test the null of homogeneity, where you tests the null of independence. So what does he mean by homogeneity?

I suppose he refers to the test of **marginal homogeneity** in paired test data. This test is used to assess whether there was a change across time (repeated measures). This is however not what you want to assess in the first place. My guess is that he did not understand you want to test whether gender and employment at time point *x* are related. Maybe he also tried to suggest that what you *should* test is change across time (or no change, in which case the multiple repeated contingency would be called homogenous indeed).

## Context of my answer

I self-studied this question yesterday (the part concerning the possibility to use mixed models here). I shamelessly dump my fresh new understanding on this approach for 2x2 tables and wait for more advanced peers to correct my imprecisions or misunderstandings. My answer will be then lengthy and overly didactic (at least trying to be didactic) in order to help but also expose my own flaws. First of all, I must say that I shared your confusion that you stated here.

I've read about multi-level models, which sound like they are intended the handle this situation when the underlying variables are continuous (e.g., real numbers) and when a linear model is appropriate

I studied all the examples from this paper random-effects modelling of categorical response data. The title itself contradicts this thought. For our problem with 2x2 tables with repeated measurement, the example in section 3.6 is germane to our discussion. This is for reference only as my goal is to explain it. I may edit out this section in the future if this context is not necessary anymore.

## The model

**General Idea**

The first thing to understand is that the random effect is modelled not in a very different way as in regression over continuous variable. Indeed a regression over a categorical variable is nothing else than a linear regression over the logit (or another link function like probit) of the probability associated with the different levels of this categorical variable. If $\pi_i$ is the probability to answer yes at the question $i$, then $logit(\pi_{i})= FixedEffects_i + RandomEffect_i$. **This model is linear** and random effects can be expressed in a classical numerical way like for example $$RandomEffect_i\sim N(0,\sigma)$$ In this problem, the random effect represents the subject-related variation for the same answer.

**Our case**

For our problem, we want to model
$\pi_{ijv}$ the probability of the subject to answer "yes" for the variable v at interview time j. The logit of this variable is modeled as a combination of fixed effects and subject-related random effects.

$$logit(\pi_{ijv})=\beta_{jv}+u_{iv}$$

## About the fixed effects

The fixed effects are then related to the probability to answer "yes" at time j at the question v. According to your scientific goal you can test with a likelihood ratio to test if the equality of certain fixed effects must be rejected. For example, the model where $\beta_{1v}=\beta_{2v}=\beta_{3v}...$ means that there is no change tendency in the answer from time 1 to time 2. If you assume that this global tendency does not exist, which seems to be the case for your study, you can drop the $i$ straightaway in your model $\beta_{jv}$ becomes $\beta_{v}$. By analogy, you can test by a likelihood ratio if the equality $\beta_{1}=\beta_{2}$ must be rejected.

## About random effects

I know it's possible to model random effects by something else than normal errors but I prefer to answer on the basis of normal random effects for the sake of simplicity.
The random effects can be modelled in different ways. With the notations $u_{ij}$ I assumed that a random effect is drawn from its distribution each time a subject answer a question.This is the most specific degree of variation possible. If I used $u_{i}$ instead, it would have mean that a random effect is drawn for each subject $i$ and is the same for each question $v$ he has to answer (some subjects would then have a tendency to answer yes more often). You have to make a choice. If I understood well, you can also have **both random effects** $u_{i}\sim N(0,\sigma_1)$ which is subject-drawn and $u_{ij}\sim N(0,\sigma_2)$ which is subject+answer-drawn. I think that your choice depends of the details of your case. But If I understood well, the risk of overfitting by adding random effects is not big, so when one have a doubt, we can include many levels.

## A proposition

I realize how weird my answer is, this is just an embarrassing rambling certainly more helpful to me than to others. Maybe I ll edit out 90% of it.
I am not more confident, but more disposed to get to the point.
I would suggest to compare the model with nested random effects ($u_{i}+u_{iv}$) versus the model with only the combinated random effect ($u_{iv}$). The idea is that the $u_i$ term is the sole responsible for the dependency between answers. Rejecting independence is rejecting the presence of $u_{i}$. Using glmer to test this would give something like :

```
model1<-glmer(yes ~ Question + (1 | Subject/Question ), data = df, family = binomial)
model2<-glmer(yes ~ Question + (1 | Subject:Question ), data = df, family = binomial)
anova(model1,model2)
```

Question is a dummy variable indicating if the question 1 or 2 is asked.
If I understood well, `(1 | Subject/Question )`

is related to the nested structure $u_{i}+u_{iv}$ and `(1 |Subject:Question)`

is just the combination $u_{iv}$. `anova`

computes a likelihood ratio test between the two models.

## Best Answer

I have a suspicion that 'exact' tests and sampling weights are essentially incompatible concepts. I checked in Stata, which has good facilities for sample surveys and reasonable ones for exact tests, and its 8 possible test statistics for a crosstab with sample weights don't include any 'exact' tests such as Fisher's.

The relevant Stata manual entry (for

svy: tabulate twoway) advises using its default test in all cases. This default method is based on the usual Pearson's chi-squared statistic. To quote:"To account for the survey design, the statistic is turned into an F statistic with noninteger degrees of freedom by using a second-order Rao and Scott (1981, 1984) correction".

Refs: