I'm trying to calculate a covariance matrix using weighted data in a single pass, and I'm not sure that I'm doing it correctly. I found a wikipedia article which gave the following python* code:
def online_covariance(data1, data2):
mean1 = mean2 = 0.
M12 = 0.
n = len(data1)
for i in range(n):
delta1 = (data1[i] - mean1) / (i + 1)
mean1 += delta1
delta2 = (data2[i] - mean2) / (i + 1)
mean2 += delta2
M12 += i * delta1 * delta2 - M12 / (i + 1)
return n / (n - 1.) * M12
I think I should be able to adapt this to include weights by sticking in weights in a few places, something like this:
def online_covariance(data1, data2, weights):
mean1 = mean2 = 0.
M12 = 0.
n = len(data1)
for i in range(n):
wt = weights[i]
delta1 = (data1[i] - mean1) * wt / (i + wt)
mean1 += delta1
delta2 = (data2[i] - mean2) * wt / (i + wt)
mean2 += delta2
M12 += i / wt * delta1 * delta2 - M12 * wt / (i + wt)
return n / (n - 1.) * M12
I'm aware that there may be some problems with the n / (n - 1.)
factor in the last line, but I'm not overly concerned with that since I have large statistics and weights are all small compared to the sum weight.
I've run some tests and the output looks reasonable, but my "derivation" here is pretty crude, is it valid? Is there some better reference on this than the wikipedia article I found?
*(note that I don't care about it being python, and in reality I'm calculating the full matrix, not just one element)
Best Answer
I dug into the derivation for the incremental, weighted variance algorithm. I came up with this, which gives the correct results. It turned out to be just a really simple modification of the variance algorithm. I've updated the wikipedia article with the code sample:
Note we use
dx
when incrementingC
; you can't do(x - meanx)
, since you need to use the oldmeanx
.