Solved – Formula for recommended sample size for multivariate testing

multivariate analysissample-size

Context: To recommend a minimum sample size when performing multivariate testing of a web page. The sample size would vary based on the number of factors being tested (e.g. A heading, and an image) and the number of variations of a factor (e.g. Two different headings, and three different images). The goal could be to see which combination caused the most purchases of a product.

Based on a recommended minimum sample size of 100 for a single factor with perhaps two variations, I'm trying to work out a formula that recommends a sample size with multiple factors and variations.

The formula I first came up with is shown below where n is the number of variations for that factor.

$ samplesize =
100*\left((n_{f1}-1)^{n_{f1}-1}*(n_{f2}-1)^{n_{f2}-1}*\ldots\right)$

Does this seem reasonable and/or is there a simpler formula that would be as reasonable? The intended audience are online business owners who aren't necessarily strong at maths.

Thanks for reading!

Best Answer

Commonly, the different values that a factor can attain in an experiment are called "levels". So let's say there are $k$ factors, and factor $j$ has $n_j$ levels.

There are $n_{f1}\cdot n_{f2}\cdot \dots \cdot n_{fk}$ possible factor combinations, i.e. possible versions of web pages that could be viewed. To answer the question whether any one of these versions is better than any other one, each has to be viewed a certain number of times, let's say $N$ for simplicity, for a sample size of $2N$. (You assumed $N = 100$). So the total sample size required (the total number of pairs of eyes that you'll need for all versions) is $$ N \cdot n_{f1}\cdot n_{f2}\cdot \dots \cdot n_{fk} $$ which can become pretty large, although it's generally smaller than your formula.

The size of $N$ in turn depends on the separation of the purchase probabilities that you want to distinguish. If all purchase probabilities are close to each other, then $N$ would have to be quite large to pick the larger probability reliably even in a simple pairwise comparison. Examples: If you use $N = 100$ and one particular page design has purchase probability $p = .5$ and you are using a test with significance level 0.95, then you'll have a better than even chance of correctly identifying another design as better only if that design has purchase probability at least $p = .62$ or so. If that other design has $p = .55$, you won't be able to tell with $N = 100$ ... although it means 10% more revenue. Paradoxically you would be forced to work with an even larger sample size if the differences in probabilities are smaller.

In practice, one would not use all possible level combinations for all factors, because experience shows that interactions between multiple factors rarely matter. For example if you have four factors (say number of headings, number of images, number of columns, background color), then it is likely that once two have been set (say number of headings and number of images), the other two factors don't matter that much any more. This can be used to reduce the total number of level combinations. Google "fractional factorial design".