[Math] How to estimate sample mean and variance from derived data

st.statistics

Hi,

(I hope this is not too basic)

I basically have a set of data from the same underlying distribution (which I would like to estimate), but I only have available the mean and N from partitions of the data. I can estimate the mean using a weighted average, and, I thought, the sample variance using weighted variance (e.g: https://stat.ethz.ch/pipermail/r-help/2008-July/168762.html)

However, I can't make much sense of the results. And I suspect I'm misunderstanding how/what weighted sample average should work.

Edit: From a set of (unobserved) data points all drawn from some distribution with some $\mu$ and $\sigma$, I can observe only the average and cardinality for partitions of the data. I.e. my observed values $X_i$ are each the average of some partition of size $N_i$, and thus have, $\mu_i=\mu$ and a $\sigma_i=\sigma/\sqrt{N_i}$.

Question: how can I estimate $\mu$ and $\sigma$ from the observed $X_i$s? Note that I can't observe the variance (or any other parameters) for the partitions.

The goal is to identify partitions that do not fit the expected distribution, i.e. identify partitions where $X_i$ is too many $\sigma_i$s from $\mu$.

Best Answer

Note: New answer.

If you don't know the variances $Var X_1,...,VarX_m$, you have to make parametric assumptions about the distribution of the data. For example, under the assumption of a Poisson distribution, the the maximum likelihood estimator of the variance would simply be the aggregated mean $EX$. Under the assumption of the normal distribution, the intuitive answer

$$VarX = (1/m)\sum_i N_i (EX_i-\mu)^2$$ is the maximum likelihood estimator.

Assuming your distirbution is normal $(\mu, \sigma^2)$, then the maximum likelihood estimate of $\mu$ is the aggregated mean $(1/N)\sum_i N_i EX_i$. The profile likelihood as a function of $\sigma^2$ is then

$$L(\sigma^2) = \prod_i \frac{1}{\sqrt{2\pi\sigma^2/N_i}} exp\left(\frac{-(EX_i - \mu)^2}{2\sigma^2/N_i}\right).$$

The maximum likelihood estimate of $\sigma^2$ is the maximizing argument of the above. For computational convenience we take the logarithm of the likelihood as it is easier to maximize.

$$\log L(\sigma^2) = \sum_i -\frac{1}{2}\log(2\pi/N_i) - \frac{1}{2}\log(\sigma^2) - \frac{(EX_i - \mu)^2}{2\sigma^2/N_i}$$

Setting the derivative to zero,

$$0=\frac{\partial L(\sigma^2)}{\partial \sigma^2} = \sum_i -\frac{1}{2\sigma^2} + \frac{(EX_i-\mu)^2}{2/N_i}\frac{1}{\sigma^4}$$

multiplying by $\sigma^4$,

$$ 0 = \sum_i -\frac{1}{2}\sigma^2 + \frac{N_i(EX_i-\mu)^2}{2}$$

which yields the maximum likelihood estimator

$$\sigma^2 = (1/m)\sum_i N_i (EX_i-\mu)^2.$$

For general parametric $f(\theta)$ you need the densities of the convolutions $f * \cdots * f(\theta)$, which makes the problem much more difficult outside of a few special parametric families. However if all of your $N_i$ are large, you can use the normal distribution as an approximation by the Central Limit Theorem.

Related Question