Solved – Online weighted covariance

covarianceonline-algorithmspython

I'm trying to calculate a covariance matrix using weighted data in a single pass, and I'm not sure that I'm doing it correctly. I found a wikipedia article which gave the following python* code:

def online_covariance(data1, data2):
    mean1 = mean2 = 0.
    M12 = 0.
    n = len(data1)
    for i in range(n):
        delta1 = (data1[i] - mean1) / (i + 1)
        mean1 += delta1
        delta2 = (data2[i] - mean2) / (i + 1)
        mean2 += delta2
        M12 += i * delta1 * delta2 - M12 / (i + 1)
    return n / (n - 1.) * M12

I think I should be able to adapt this to include weights by sticking in weights in a few places, something like this:

def online_covariance(data1, data2, weights):
    mean1 = mean2 = 0.
    M12 = 0.
    n = len(data1)
    for i in range(n):
        wt = weights[i]
        delta1 = (data1[i] - mean1) * wt / (i + wt)
        mean1 += delta1
        delta2 = (data2[i] - mean2) * wt / (i + wt)
        mean2 += delta2
        M12 += i / wt * delta1 * delta2 - M12 * wt / (i + wt)
    return n / (n - 1.) * M12

I'm aware that there may be some problems with the n / (n - 1.) factor in the last line, but I'm not overly concerned with that since I have large statistics and weights are all small compared to the sum weight.

I've run some tests and the output looks reasonable, but my "derivation" here is pretty crude, is it valid? Is there some better reference on this than the wikipedia article I found?

*(note that I don't care about it being python, and in reality I'm calculating the full matrix, not just one element)

Best Answer

I dug into the derivation for the incremental, weighted variance algorithm. I came up with this, which gives the correct results. It turned out to be just a really simple modification of the variance algorithm. I've updated the wikipedia article with the code sample:

def online_weighted_covariance(data1, data2, data3):
    meanx = meany = 0
    wsum = wsum2 = 0
    C = 0
    for x, y, w in zip(data1, data2, data3):
        wsum += w
        wsum2 += w*w
        dx = x - meanx
        meanx += (w / wsum) * dx
        meany += (w / wsum) * (y - meany)
        C += w * dx * (y - meany)

    population_covar = C / wsum
    # Bessel's correction for sample variance
    # Frequency weights
    sample_frequency_covar = C / (wsum - 1)
    # Reliability weights
    sample_reliability_covar = C / (wsum - wsum2 / wsum)

Note we use dx when incrementing C; you can't do (x - meanx), since you need to use the old meanx.