It would be a violation of independence to "group the data by conditions and not care that multiple data points come from one subject". So that is a no go. One approach is to "to take the mean of all measurements for each condition from each subject and then compare the means". You could do it that way, you wouldn't violate independence, but you are losing some information in the aggregation to subject level means.
On the face of it, this sounds like a mixed design with conditions between subjects and multiple time periods measured within subjects. However, that raises the question, why did you collect data at multiple time points? Is the effect of time, or the progression of a variable over time expected to be different between conditions? If the answer is yes to either of those questions, then given the structure of the data, I would expect that what you are interested in is a mixed ANOVA. The mixed ANOVA will partition the subject variance out of the SSTotal "behind your back" as it were. But whether that partitioning helps out your between subjects test of conditions depends on several other factors.
Anyway, in SPSS/PASW 18 Analyze -> General Linear Model -> Repeated Measures. You'll have one row for each subject and one column for each time point as well as one as their condition identifier. The condition identifier will go into the "between" section and the repeated measures will be taken care of when you define the repeated measure factor.
Context of my answer
I self-studied this question yesterday (the part concerning the possibility to use mixed models here). I shamelessly dump my fresh new understanding on this approach for 2x2 tables and wait for more advanced peers to correct my imprecisions or misunderstandings. My answer will be then lengthy and overly didactic (at least trying to be didactic) in order to help but also expose my own flaws. First of all, I must say that I shared your confusion that you stated here.
I've read about multi-level models, which sound like they are intended the handle this situation when the underlying variables are continuous (e.g., real numbers) and when a linear model is appropriate
I studied all the examples from this paper random-effects modelling of categorical response data. The title itself contradicts this thought. For our problem with 2x2 tables with repeated measurement, the example in section 3.6 is germane to our discussion. This is for reference only as my goal is to explain it. I may edit out this section in the future if this context is not necessary anymore.
The model
General Idea
The first thing to understand is that the random effect is modelled not in a very different way as in regression over continuous variable. Indeed a regression over a categorical variable is nothing else than a linear regression over the logit (or another link function like probit) of the probability associated with the different levels of this categorical variable. If $\pi_i$ is the probability to answer yes at the question $i$, then $logit(\pi_{i})= FixedEffects_i + RandomEffect_i$. This model is linear and random effects can be expressed in a classical numerical way like for example $$RandomEffect_i\sim N(0,\sigma)$$ In this problem, the random effect represents the subject-related variation for the same answer.
Our case
For our problem, we want to model
$\pi_{ijv}$ the probability of the subject to answer "yes" for the variable v at interview time j. The logit of this variable is modeled as a combination of fixed effects and subject-related random effects.
$$logit(\pi_{ijv})=\beta_{jv}+u_{iv}$$
About the fixed effects
The fixed effects are then related to the probability to answer "yes" at time j at the question v. According to your scientific goal you can test with a likelihood ratio to test if the equality of certain fixed effects must be rejected. For example, the model where $\beta_{1v}=\beta_{2v}=\beta_{3v}...$ means that there is no change tendency in the answer from time 1 to time 2. If you assume that this global tendency does not exist, which seems to be the case for your study, you can drop the $i$ straightaway in your model $\beta_{jv}$ becomes $\beta_{v}$. By analogy, you can test by a likelihood ratio if the equality $\beta_{1}=\beta_{2}$ must be rejected.
About random effects
I know it's possible to model random effects by something else than normal errors but I prefer to answer on the basis of normal random effects for the sake of simplicity.
The random effects can be modelled in different ways. With the notations $u_{ij}$ I assumed that a random effect is drawn from its distribution each time a subject answer a question.This is the most specific degree of variation possible. If I used $u_{i}$ instead, it would have mean that a random effect is drawn for each subject $i$ and is the same for each question $v$ he has to answer (some subjects would then have a tendency to answer yes more often). You have to make a choice. If I understood well, you can also have both random effects $u_{i}\sim N(0,\sigma_1)$ which is subject-drawn and $u_{ij}\sim N(0,\sigma_2)$ which is subject+answer-drawn. I think that your choice depends of the details of your case. But If I understood well, the risk of overfitting by adding random effects is not big, so when one have a doubt, we can include many levels.
A proposition
I realize how weird my answer is, this is just an embarrassing rambling certainly more helpful to me than to others. Maybe I ll edit out 90% of it.
I am not more confident, but more disposed to get to the point.
I would suggest to compare the model with nested random effects ($u_{i}+u_{iv}$) versus the model with only the combinated random effect ($u_{iv}$). The idea is that the $u_i$ term is the sole responsible for the dependency between answers. Rejecting independence is rejecting the presence of $u_{i}$. Using glmer to test this would give something like :
model1<-glmer(yes ~ Question + (1 | Subject/Question ), data = df, family = binomial)
model2<-glmer(yes ~ Question + (1 | Subject:Question ), data = df, family = binomial)
anova(model1,model2)
Question is a dummy variable indicating if the question 1 or 2 is asked.
If I understood well, (1 | Subject/Question )
is related to the nested structure $u_{i}+u_{iv}$ and (1 |Subject:Question)
is just the combination $u_{iv}$. anova
computes a likelihood ratio test between the two models.
Best Answer
Yes, more data as a general rule is usually better and given that you have just one answer per condition, then the move to get 2 data points per condition for each participant is a good idea. Its not just 1 more data point, but also double the amount of data on which to model a response per person.
You could take the average of the participant's answers. A strict answer to that depends on the variability between the answers. BUT, why bother, just add another factor for each participant 'question number' with two levels '1' and '2'. If there is no difference between the answer order, then the whole model will be the same as taking the mean. On the other hand, if the there is a systematic difference between the answers (for some reason), you can find out about that as well, essentially for free.
'Need' is a tricky word here, but yes, this could potentially decrease the amount of participants you need. Asking more questions increases your N, having them be within subject comparisons is even better (usually). So by asking more questions per person you should be reducing your variance, and thus increasing the likelihood of finding a significant model.