Solved – Goodness of fit (chi-squared) for non-exclusive categories

chi-squared-testcontingency tablescount-datanonparametricstatistical significance

I'm analyzing some data, which is sorted into some categories, and would like to compare my results to a hypothetical distribution of those categories. I feel like the chi-squared test for goodness of fit is what I am looking for, however, my categories are not mutually exclusive.

To offer an example:

Let's say I want to see if a certain population has a pronounced preference for a certain type of music. So I give my participants a list of 50 songs and ask them to pick their 15 favourite songs.

Each song fits within (say 5) given categories e.g. "Thunderstruck" by AC/DC counts as Rock, "Ride of the Valkyries" by Wagner is Classical, etc.

At the end we have a frequency table showing the total number of rock songs chosen, the total number of classical songs, etc.

Category Frequency
Rock 340
Classical 121
Country 206
Jazz 64
Folk 226

I would like to compare the results to the frequencies I would expect if their choices were pretty much random, to see if their is a distinct preference in the population.

Of course, the problem is that people can pick rock songs and classical songs and so the music genre categories are not mutually exclusive. I believe this means I cannot use the Chi-squared test, so I am unsure what approach to take.

So far I can only think of 2 options:

  1. Find the significance of each category separately using a one-proportion z-test
  2. Using some sort of bootstrap method where we compare the observed frequencies to a simulated distribution of a large number of randomly chosen songs.

Can anyone suggest an alternative?

Thanks

  • Update 26/4/2018

Based on the comments I should mention that, in this example, there is an uneven number of songs in each category, with more rock songs than other types.

Another way of stating my research question would be: If the counts of rocks songs chosen the highest, is there a genuine preference for rock music in the population, or were more rock songs chosen because there were more rock songs to choose from.

  • Update 30/4/2018

I've been working at this an my current approach is to treat each musical genre separately and try to deal with them using simulations.

Essentially I have simulated 500 participants randomly choosing 15 songs from the list. I then count how many songs fall into each category. I repeat this process for 5000-10,000 iterations to build a sampling distribution for the frequency of each category.

If my observed count for a given genre falls towards the edges of the sampling distribution, say above the 95 percentile, I will take it as indicating a significant preference for that genre.

Could anyone offer some feedback as the whether this approach makes sense?

I was also hoping for a sanity check regarding the next question for this data, which involves comparing musical tastes in different populations.

Let's say I record the gender of each participant and I want to check if men and woman have different musical preferences. I believe that I can use a permutation test, i.e. randomly shuffling the gender labels and recounting the proportions for men and women to get a sampling distribution. Does that make sense?

Best Answer

You say in a comment that

The songs themselves can not be both jazz and folk, but each participant can choose both a jazz song and a folk song from the list. Therefore, in the crosstab, each participant can contribute to more than one cell.

The real problem here is what you state in the second sentence above: each participant can contribute to more than one cell. The "problem" mentioned in the title non-exclusive categories is here a non-problem: Your categories are exclusive. So please update/correct your title and question! If your interest is in comparing musical preferences between men and women, you could present your data as a two-way contingency table, and the usual chisquared statistic would give useful description of the (lack of) homogeneity. The question is if it has the usual chisquared distribution. That could maybe be investigated using simulation, or you could try a permutation test, permuting the male/female labels( so the fifteen counts pertaining to the same person would be permuted together.)

Another approach is multinomial logistic regression. It would be very interesting if you could answer your own question now, comparing different approaches!

Related Question