Solved – How to handle regression with unequal sample size and variance

group-differenceshypothesis testingregressionsample-size

Let's say that I have 10,000 people do a study, and in this study they have to spot a bird in a scene as quickly as possible. There are various possible birds that may be presented to each person, and each person is able to complete the task in one of two ways (will refer to them as method A and method B). Each person may repeat the study as much as they want (so one person could play this game 20 times and use method A 5 times and method B 25 times).

I am investigating how method A results in faster bird detection than method B, and this faster detection interacts with the rarity of bird species (so that given a rare bird in a scene, method A will be tremendously more efficient, and given a common bird, method A will be marginally more efficient in response time for detecting the bird).

The problem is, although my dataset is large, most people choose to adopt method B instead of method A. Of the 10,000 people, 9,000 never even use method A. And of the 1,000 remaining, most do not use method A enough times for each of these 1,000 people to have data for varying bird rarities. Data for method A is also much more variable in response times than variable B because of the lack of data and the method involved.

How do I go about comparing these groups despite unequal sample sizes and variability, and then how do I investigate a possible interaction whereby the faster response times for method A is more pronounced the rarer the bird? Can I simply use a linear regression or is this a violation? (e.g., bird rarity on x-axis, averages of method B minus method A on y-axis)?

Here is an example of some mock data. The birds are classified into 5 different rarities. For instance, person 1 repeated the study 24 times (only 5 times using method A), and responded to a very common bird using method B four times (taking 6 seconds, 9 seconds, 5 seconds, and 11 seconds to detect the bird). Person 1 also never encountered an 'uncommon' bird using method A.

enter image description here

Best Answer

I think you want to investigate mixed effects models. Concentrating for a moment on the 1000 who did use method A at least once you have a 2 by 5 set of fixed effects (method and rarity) and a random effect for study participant. You need to do it that way so you can take account of the likely correlation between observations within person. You can then add an interaction term too. The unequal numbers should not be too much of a problem although obviously you do not get such precise estimates for your effects as you would have if they had distributed their efforts more evenly. The fact that some participants did not have any observation in one or more of the ten cells does not exclude them from the analysis.

Whether you should include the 9000 who only used method B is not obvious. Clearly they are uninformative for the effects involving method A but they do give some information about the relative effect of rarity.

You will want to run some diagnostics after the modelling phase as I suspect that you outcome variable has mean related to variance so you may end up wanting to model log time not time itself.

Related Question