Solved – Statistical test for comparing two frequency distributions expressed as arrays (buckets) of values

chi-squared-testdistributionsrstatistical significance

I am looking for an appropriate statistical test that will compare two frequency distributions, where the data is in the form of two arrays (or buckets) of values.

For example, suppose I have two distributions, where A, B, and C are observed outcomes from a software logging system (such as whether customers clicked on button A, B, or C).

HISTORICAL: 
A        B        C
122319   295701   101195

ONE MONTH:
A        B        C
1734     3925     1823

My goal is to create an automated A/B testing system. For example, we've collected this data for the last 6 months (in the HISTORICAL data set). After we roll out a new algorithm, we can collect new results (in the ONE MONTH data set). If the two distributions are "significantly" different, we'd then know to take some action.

My specific questions:

  1. What's the proper statistical test for this problem, and how could I know when these distributions differ significantly? An answer using R or python would be appreciated.

  2. What's the minimum number of samples I'd need for both HISTORICAL and ONE MONTH for the test to be valid?

I've read several other questions related to chi-squared and Kolmogorov-Smirnov tests but don't know where to begin. Related questions:

Thank you for any help.

Best Answer

Run a chi-squared goodness-of-fit test to determine if an observed frequency distribution observed differs from a desired (perhaps theoretical) distribution expected.

Note carefully the definition of the statistic $X^2$ (the eponymous chi squared):

$$X^2 = \sum_{i}^{}{ \frac{(observed_i - expected_i)^2}{expected_i} }$$

Both series should be of the same order, so one of them needs to be scaled to the other. One can scale expected to observed.

Below is some Python code that encapsulates this test. To make the final evaluation, a decision is made against the test's resulting p-value.

#!/usr/bin/env python 
import numpy as np
import scipy.stats as stats

def ComputeChiSquareGOF(expected, observed):
    """
    Runs a chi-square goodness-of-fit test and returns the p-value.
    Inputs:
    - expected: numpy array of expected values.
    - observed: numpy array of observed values.
    Returns: p-value
    """
    expected_scaled = expected / float(sum(expected)) * sum(observed)
    result = stats.chisquare(f_obs=observed, f_exp=expected_scaled)
    return result[1]

def MakeDecision(p_value):
    """ 
    Makes a goodness-of-fit decision on an input p-value.
    Input: p_value: the p-value from a goodness-of-fit test.
    Returns: "different" if the p-value is below 0.05, "same" otherwise
    """  
    return "different" if p_value < 0.05 else "same"

if __name__ == "__main__":
    expected = np.array([122319, 295701, 101195])
    observed1 = np.array([1734, 3925, 1823])
    observed2 = np.array([122, 295, 101])

    p_value = ComputeChiSquareGOF(expected, observed1)
    print "Comparing distributions %s vs %s = %s" % \
        (expected, observed1, MakeDecision(p_value))

    p_value = ComputeChiSquareGOF(expected, observed2)
    print "Comparing distributions %s vs %s = %s" % \
        (expected, observed2, MakeDecision(p_value))

The output from running this test is:

Comparing distributions [122319 295701 101195] vs [1734 3925 1823] = different
Comparing distributions [122319 295701 101195] vs [122 295 101] = same