1. You could introduce a third variable (the actual image). More generally, you can put all your independent variables - actual image, age group, sex into a model for predicted image (e.g. a logistic regression model) and test various hypotheses.
2. If you reject the null, it's reasonable to look at the source of the difference. e.g. if you conclude that men and women are different, you can (for example) simply point to the sign of a coefficient and say 'it was significant because the women did better'.
However, beware Simpson's paradox!
You simply have to ask yourself, "How do I write the null hypothesis?". Consider a $2 \times k$ contingency table of frequencies of some behavior (y/n) among a number of $k$ groups. Treating the 1st group as referent, you have $k-1$ odds ratios ($\theta_i, i = 1, 2, \ldots, k-1$) that describe the association between frequency and group.
Under independence as with homogeneity, you assume that all odds-ratios are 1. That is, the likelihood of responding "yes" to the condition is equally likely irrespective of group assignment. If those assumptions fail, at least one group is different.
$\mathcal{H}_0(\mbox{homogeneity}): \sum_{i=1}^{k-1} |\theta_i| = 0$
$\mathcal{H}_0(\mbox{independence}): \sum_{i=1}^{k-1} |\theta_i| = 0$
And this test can be conducted with the Pearson Chi-square test using observed/expected frequencies, which is the score test for the logistic regression model adjusting for $k-1$ indicator variables for group membership. So structurally we may say that these tests are the same.
However, differences arise when we consider the nature of the grouping factor. In this sense, the contextual application of the test, or rather its name, is important. A group may be directly causal of an outcome, like the presence or absence of a gene or allele patterns of a trait in which case, when we reject the null we conclude that the outcome depends on the grouping factor in question.
On the other hand, when we test homogeneity, we exonerate ourselves of making any causal assumptions. Thus, when the "group" is a sophisticated construct like race (which causes and is caused by genetic, behavioral, and socioeconomic determinants) we can make conclusions like "racial-ethnic minorities experience housing disparities as evidenced by heterogeneity in neighborhood deprivation index". If someone countered such an argument by saying, "well that's because minorities achieve lower education, earn lower income, and gain less employment" you could say, "I didn't claim that their race caused these things, just simply that if you look at one's race, you can make predictions about their living condition."
In that way, tests of dependence are a special case of tests of homogeneity where the possible effect of lurking factors is of interest and should be handled in a stratified analysis. Using multivariate adjustment in the analogous logistic regression model achieves such a thing, and we may still say we are conducting a test of dependence, but not necessarily homogeneity.
Best Answer
If you have individual-level data, you can present this as a $2\times 2$ contingency table, and then use a chi-squared test.
These are treated in many posts on this ste, for some examles
In a $2 \times 2$ contingency table, does the dependent variable go on the rows or columns?
What does it mean when odds ratio and risk ratio aren't approximately equal ? $2 \times 2$ contingency table
Chi^2 test for 2x2-contingency table using probabilities instead of count data