Solved – Looking for an unbiased version of the empirical cumulative distribution function that I can interpolate

cumulative distribution functionempirical-cumulative-distr-fninterpolationunbiased-estimator

Most definitions of the ECDF define it as (#elements <= threshold) / #elements. Matlab and R both implement their ecdf() functions using this formula.

In my testing, however, I find that there is a small bias to this estimate when creating samples using the standard uniform distribution, the standard normal distribution, or the beta distribution with common alpha and beta values.

To be clear, I am not talking about the bias of the ECDF as the number of samples goes to infinity. The ECDF is unbiased in this regard. I am talking about the bias that occurs when applying the ECDF to N number of samples, where N remains fixed, and a new set of N samples is generated over and over again. In this regard, the ECDF has a bias that is inversely proportional to N.

In our application, we are generating a new set of data each time the algorithm changes or the algorithm parameters change. Thus the underlying distribution, which is unknown, is different for each test. I therefore want the best estimation for the N samples of each test, but cannot combine samples from different test runs.

I am using interpolation to estimate the CDF for thresholds that are between sample values, and am assuming that the underlying distribution is continuous, but unknown. I am interpolating because our thresholds are continuous and we'd like the estimated CDF to be continuous, not quantized.

Wikipedia provides an alternative definition for the ECDF which does not have this bias and also estimates with less error: (#elements <= threshold) / (#elements + 1)

Should there be two definitions of the ECDF, one for estimating from discrete distributions and a different one for continuous ones?

Are there any proofs about the bias of an ECDF when estimating a CDF?

Here is a simple example that makes the problem obvious. Put four apples in a row and draw a line between apples 2 and 3. What percent of the apples are on each side of the line? The EDCF says the line through apple 2 is 50%, the line through 3 is 75%, and the interpolated value is 62.5%, so the estimate is that 62.5% of the apples are to the left of the line that is midway between apples 2 and 3.

Here is my Matlab code that shows the bias:

% Method 1 - ECDF
% Method 2 - Alternative ECDF
cdf = estimateCdf(-2:2, 0.0);
assert(cdf(1) == 0.6);
assert(cdf(2) == 0.5);

numTests = 100 * 1000;
numSamples = 10;
err=zeros(numTests, 2);

for i=1:numTests
    threshold = randn(1, 1) * 2;
    groundTruth = 1 - (1 - erf(threshold)) / 2;
    scores = sort(randn(1, numSamples) * sqrt(0.5)); %% ERF assumes a variance of 0.5
    err(i, :) = estimateCdf(scores, threshold) - groundTruth;
end

fprintf('\nTests = %d\n', numTests);
for method = 1:2
    fprintf('Method %d\n', method);
    fprintf('  RMS Error:  %1.6f\n', rms(err(:,method)));
    fprintf('  Bias:       %1.6f\n', mean(err(:,method)));
end

function cdf = estimateCdf(scores, threshold) 
    idxList = find(scores < threshold);
    cdf = zeros(1,2);

    % If there are no scores below threshold then CDF is zero
    if isempty(idxList)
        for method = 1:2
            cdf(method) = 0;
        end

    % If all scores are below threshold then CDF is 1
    elseif idxList(end) == numel(scores)
        for method = 1:2
            cdf(method) = 1;
        end

    % Interpolate scores between 2 nearest points to the threshold
    else
        x0 = idxList(end);
        x = [x0, x0 + 1];
        xIdx = interp1(scores(x) - threshold, x, 0, 'linear');

        % Interpolate the scores to a CDF
        cdf(1) = interp1(x, x / length(scores), xIdx, 'linear');
        cdf(2) = interp1(x, x / (length(scores) + 1), xIdx, 'linear');
    end
end

Best Answer

The empirical distribution function is an unbiased estimator of the true underlying distribution. For simplicity, let's assume the underlying distribution is absolutely continuous and admits a density $f$. If your sample consists of the iid points $\{X_n\}_{n=1}^N$, we can take the expectation of the ECDF:

$$ \begin{aligned} E(\hat{F}(x)) = E\big(\frac{1}{N}\sum_{n=1}^N I(X_n\leq x)\big) = \frac{1}{N}\sum_{n=1}^N E(I(X_n\leq x)) &= \frac{1}{N}\sum_{n=1}^N\int_{-\infty}^\infty I(y\leq x)f(y) dy\\ &=\frac{1}{N}\sum_{n=1}^N\int_{-\infty}^x f(y) dy\\ &= \frac{1}{N}\sum_{n=1}^N F(x)= F(x), \end{aligned} $$

where I am using $I$ as the indicator function. You could certainly interpolate this function, though I think upon doing any kind of interpolation you will introduce a bias to your estimate, e.g.

$$ \begin{aligned} E(\hat{F}(x)) &= E\bigg(\frac{1}{N}\sum_{n=1}^N I(X_n\leq x) + I(X_{(n)}\leq x \leq X_{(n+1)})\frac{x - X_{(n)}}{X_{(n+1)} - X_{(n)}}\bigg)\\ &= F(x)+\frac{1}{N}\sum_{n=1}^NE\bigg(I(X_{(n)}\leq x \leq X_{(n+1)})\frac{x - X_{(n)}}{X_{(n+1)} - X_{(n)}}\bigg), \end{aligned} $$ where $X_{(n)}$ denotes the $n$th ordered (ranked) statistic. Above is one way of writing out a simple linear interpolation between the points. You might be able to show this latter bias vanishes, but it will likely depend on the distribution and may not always be the case.