Solved – Standard error for aggregated proportions

aggregationbinomial distributionconfidence intervalstandard error

In order to obtain confidence intervals for proportions I'm trying to calculate the standard error, but I'm having difficulty working out what N should be in a case such as mine.

My data is such that each observation is the number of occurrences of a particular outcome in a number of trials. Here is a vastly simplified representation of my data:

Participant Item Condition X N
     1       1       A    10 50
     1       2       A    15 50
     1       1       B     5 50
     1       2       B    20 50
     2       1       A    15 50
     2       2       A    30 50
     2       1       B     5 50
     2       2       B    25 50

Where N is the number of trials, X is the number of a specific outcome, condition is a fixed effect variable and participant and item are two random effect variables.

To get the mean proportion for each condition, the common approach in my field would be to take the mean of each participant's mean proportion, e.g. the mean of participant 1's mean for condition A and participant 2's mean for condition A.

What I'm not sure of is what N should be when I calculate the standard error of that proportion using $\sqrt{\frac{p(1-p)}{N}}$. This isn't discussed much in the field, but what little consensus I can find suggests that N should be the number of participants (in the above case, 2), which was the level to which the data was aggregated in order to calculate the mean. When I do this for any real data the confidence intervals are so large that the experienced researchers I've spoken to doubt they're accurate (the real data typically come from around 25 participants).

An alternative suggestion has been to use total number of observations that are being aggregated over (in the above case, this would be 4). This seems more plausible, and makes sense when each observation is binomial. However, I'm still a little worried about whether the number of trials that make up each observation ought to be factored in somehow. If each trial is treated as a single observation then you end up with a large N (200 in this example, 30k+ in real data) which leads to implausibly small confidence intervals.

Best Answer

This is one of the most typical and central questions in statistics. For descriptive tables and plots (involving Standard Error/Deviation, Confidence Intervals, etc), the data ought to be aggregated to the level from which you want to generalize. That level is—in this case and very often—participants. As hinted by the previous contributors, trials do not normally serve for statistical generalization (they're good for experimental validity). This realization does come as a bummer if you have first seen the effect sizes in the un-aggregated data. The mirage is caused by an inflated N.

Most descriptive and test statistics have to do with the sample size, which is specifically located in the dividend. The larger the sample (N), the smaller the Standard Deviation, Standard Error, Confidence Interval...—that is, the variation or noise. A large N in the statistics is great when it indeed correspond to the level from which we want to generalize statistically (in many fields, participants). However, when the N in the analyses is partly down to repetitive observations that have not been duly averaged, the famous assumption of independence of observations is violated. Any effects there may be bogus. Thus, Vasishth and Nicenboim (2016) warn,

'if we were to do a t-test on the unaggregated data, we would violate the independence assumption and the result of the t-test would be invalid' (p. 3).

Incidentally, it could be worse yet. McCarthy, Whittaker, Boyle, and Eyal (2017) note:

'It has also been proposed that researchers aggregate the responses of participants within the same group and use the groups/clusters as the unit of analysis (Stevens, 2007). However, because this would result in losing sample size at the participant level, this approach is not optimal given the already small numbers of groups typically studied in group work research' (p. 10).

Of course, in most fields, that extreme aggregation would leave us with a couple of observations only, so it's out of the question.

Participant-level aggregation is only un-required if all repetitive measurements, such as trials, are reliably factored in—as often done in Linear Mixed Effects models.