[Math] Ratio of Averages vs Average of Ratios

averageratio

I'm conducting a water savings analysis and I had a couple of questions. If you answer, please provide links to references so I can read up. Thank you! 🙂

Let's say for removing grass from a landscape, 10 people did it and their change in water usage is on average of 100 billing units. These 10 people removed an average of 1000 sqft of grass. What we want is to take the billing units into gallons and estimate the average gallons/sqft.

The Ratio of Averages to get the average gallons saved/sqft removed would be to take 100/1000 *conversion factor.

However, these 10 people had a range of 10-9000 sqft of turf removed and the median is 500 sqft. And splicing the change in water usage by different sqft categories, we see that with the larger amount of sqft removed, the savings decreases.

Because of these outliers and other data conditions, it would seem that looking at each of the 10 people's actual savings per square foot and then averaging that variable would provide beneficial results (the average of ratios).

Which method is correct? Is one method over the other not acceptable?

I was also told that dividing each of the change in water use by square feet of the participant, you create a dimensionless number. How is that so? If person number 5 wanted to know what their savings was per square feet, I would take their difference in usage and divide it by their sqft. How does that become dimensionless?

Please provide links to sources so I can read up on them.

Best Answer

It depends on what you want to do. Below I formalize the problem and show some related estimation theory results. You have $n$ samples of data as ordered pairs $$(X_1, Y_1), (X_2, Y_2), ..., (X_n, Y_n)$$ where $$(X_i,Y_i) = (\mbox{area of sample $i$ (square foot)},\mbox{savings of sample $i$ (gallons)})$$ The $X_i$ value may be correlated with the $Y_i$ value. Assume $X_i>0$ for all $i \in \{1, 2, 3, ...\}$. You are considering these metrics:

  • Ratio of averages: $$ R_n = \frac{\sum_{i=1}^n Y_i}{\sum_{i=1}^n X_i} = \frac{\frac{1}{n}\sum_{i=1}^n Y_i}{\frac{1}{n}\sum_{i=1}^n X_i} $$ If $\{(X_i,Y_i)\}_{i=1}^{\infty}$ are i.i.d. vectors (with possibly correlated $X_i, Y_i$ values) then $R_n\rightarrow \frac{E[Y_1]}{E[X_1]}$ with prob 1.

  • Average of ratios: $$ M_n = \frac{1}{n}\sum_{i=1}^n \frac{Y_i}{X_i} $$ If $\{(X_i,Y_i)\}_{i=1}^{\infty}$ are i.i.d. vectors then $M_n\rightarrow E[\frac{Y_1}{X_1}]$ with prob 1.

The $R_n$ value represents an aggregate ratio (total area of all samples)/(total gallons over all samples). The $M_n$ value is an average of individual ratios.


Suppose a new (independent) person has an area $X$ and wants to predict his/her savings $Y$ using your data via some prediction $\hat{Y}$ that is a function of $X$ (where the function is formed from your data). Assume $(X,Y)$ has the same distribution as your data vectors $(X_i,Y_i)$, but independent of it. Also assume we have i.i.d. data sample vectors. Let's use simple linear predictors of the form $\hat{Y} = aX + b$.

  1. Predictor $\hat{Y} = R_n X$. To understand performance, suppose $n$ is large so $R_n \approx \frac{E[Y]}{E[X]}$. Then $$\hat{Y} \approx \frac{E[Y]}{E[X]}X \implies E[\hat{Y}] \approx \frac{E[Y]}{E[X]}E[X] = E[Y]$$ and so this predictor is "approximately unbiased," meaning that $E[\hat{Y}]\approx E[Y]$. For the mean-square error we have \begin{align} &(\hat{Y}-Y)^2 \approx \left(\frac{E[Y]}{E[X]}X - Y\right)^2 \\ &\implies E[(\hat{Y}-Y)^2] \approx \frac{E[Y]^2E[X^2]}{E[X]^2} + E[Y^2] - \frac{2E[XY]E[Y]}{E[X]} \end{align} and you can approximate the statistics $E[Y^2], E[X^2], E[XY]$ via \begin{align} E[X^2] &\approx \frac{1}{n}\sum_{i=1}^n X_i^2\\ E[Y^2] &\approx \frac{1}{n}\sum_{i=1}^n Y_i^2\\ E[XY] &\approx \frac{1}{n}\sum_{i=1}^n X_iY_i \end{align}

  2. Predictor $\hat{Y} = M_n X$. Suppose $n$ is large so $M_n \approx E[\frac{Y}{X}]$. Then $$ \hat{Y} \approx E[\frac{Y}{X}]X \implies E[\hat{Y}] \approx E[\frac{Y}{X}]E[X]$$ and so this is a biased estimator. You can calculate the mean square error in the same way and then compare.

  3. Best linear estimator. Consider a predictor of the form $\hat{Y} = aX$. Using standard estimation theory, we want to optimize the coefficient $a$ to get the smallest mean-square error $E[(\hat{Y}-Y)^2]$. We have $$ E[(\hat{Y}-Y)^2] = E[(aX-Y)^2] = a^2E[X^2] -2aE[XY] + E[Y^2]$$ Taking $\partial/\partial a = 0$ gives $$ 2aE[X^2] - 2E[XY] =0 \implies a^* = \frac{E[XY]}{E[X^2]}$$ so the best linear predictor is: $$ \hat{Y} = \left(\frac{E[XY]}{E[X^2]}\right)X$$ and you can approximate the statistics $E[XY]$ and $E[X^2]$ from the data samples $(X_i,Y_i)$ as above. This introduces a third metric $Z_n$ that can be compared to $R_n$ and $M_n$: $$ \boxed{Z_n = \frac{\frac{1}{n}\sum_{i=1}^n X_iY_i}{\frac{1}{n}\sum_{i=1}^nX_i^2} \approx \frac{E[XY]}{E[X^2]}}$$

  4. Best affine estimator. Consider predictors of the form $\hat{Y} = aX + b$. Optimizing the coefficients $a$ and $b$ gives: \begin{align} E[(aX+b-Y)^2] = a^2E[X^2] + b^2 + E[Y^2] + 2abE[X] - 2aE[XY] - 2bE[Y] \end{align} Taking $\partial/\partial a = \partial/\partial b = 0$ gives \begin{align} 2aE[X^2] + 2bE[X] -2E[XY]&=0 \\ 2b + 2aE[X] - 2E[Y] &= 0 \end{align} from which I get (assuming $Var(X)>0$ so $E[X^2]>E[X]^2$): $$ \boxed{a^* = \frac{E[XY]-E[X]E[Y]}{E[X^2]-E[X]^2} = \frac{Cov(X,Y)}{Var(X)}}$$ $$ \boxed{b^* = E[Y] - a^*E[X]}$$ So this estimator $\hat{Y} = a^*X + b^*$ is unbiased and has the smallest mean square error over all predictors of the form $\hat{Y} = aX+b$. The result $Cov(X,Y)/Var(X)$ is quite famous. You can compute approximate values $\tilde{a}^*$, $\tilde{b}^*$ by approximating the statistics $E[X^2], E[XY]$ from the data samples $(X_i,Y_i)$ as described above.