Solved – Difference between averaging data then fitting and fitting the data then averaging

errorfittingmean

If any, between fitting a line to multiple separate "experiments" then averaging the fits, or averaging the data from the separate experiments then fitting the averaged data. Let me elaborate:

I perform computer simulations which generate a curve, shown below. We extract a quantity, lets call it "A" by fitting the linear region of the plot (long times). The value is simply the slope of the linear region. There is of course an error associated with this linear regression.

We typically run 100 or so of these simulations with different initial conditions to calculate an average value of "A". I have been told that it is better to average the raw data (of the plot below) into groups of say 10, then fit for "A" and average those 10 "A"'s together.

I have no intuition for whether there's any merit to that or if it is any better than fitting 100 individual "A" values and averaging those.

data

Best Answer

Imagine we're in a panel data context where there's variation across time $t$ and across firms $i$. Think of each time period $t$ as a separate experiment. I understand your question as whether it's equivalent to estimate an effect using:

  • Cross-sectional variation in time series averages.
  • Time series averages of cross-sectional variation.

The answer in general is no.

The setup:

In my formulation, we can think of each time period $t$ as a separate experiment.

Let's say you have a balanced panel of length $T$ over $n$ firms. If we break each time period apart $(X_t, \mathbf{y}_t)$ etc... we can write the overall data as:

$$ Y = \begin{bmatrix} \mathbf{y}_1 \\ \mathbf{y}_2 \\ \ldots \\ \mathbf{y}_n \end{bmatrix} \quad \quad X = \begin{bmatrix} X_1 \\ X_2 \\ \ldots \\ X_n \end{bmatrix} $$

Average of fits:

\begin{align*} \frac{1}{T} \sum_t \mathbf{b}_t &= \frac{1}{T} \sum_t \left(X_t'X_t \right)^{-1} X_t' \mathbf{y}_t \\ &= \frac{1}{T} \sum_t S^{-1}_t \left( \frac{1}{n} \sum_i \mathbf{x}_{t,i} y_{t,i}\right) \quad \text{where } S_t = \frac{1}{n} \sum_i \mathbf{x}_{t,i} \mathbf{x}_{t,i}' \end{align*}

Fit of averages:

This isn't in general equal to the estimate based upon cross-sectional variation of time series averages (i.e. the between estimator).

$$ \left( \frac{1}{n} \sum_i \bar{\mathbf{x}}_i \bar{\mathbf{x}}_i' \right)^{-1} \frac{1}{n} \sum_i \bar{\mathbf{x}}_i \bar{y}_i $$

Where $\bar{\mathbf{x}}_i = \frac{1}{T} \sum_t \mathbf{x}_{t, i}$ etc...

Pooled OLS estimate:

Something perhaps useful to think about is the pooled OLS estimate. What is it? \begin{align*} \hat{\mathbf{b}} &= \left(X'X\right)^{-1}X'Y \\ &= \left( \frac{1}{nT} \sum_t X_t'X_t \right)^{-1} \left( \frac{1}{nT} \sum_t X_t' \mathbf{y}_i \right) \end{align*} Then use $\mathbf{b}_t = \left(X_t'X_t \right)^{-1}X_t' \mathbf{y}_i$ \begin{align*} &= \left( \frac{1}{nT} \sum_t X_t'X_t \right)^{-1} \left( \frac{1}{nT} \sum_t X_t'X_t \mathbf{b}_t \right) \end{align*}

Let's $S = \frac{1}{nT} \sum_i X'X $ and $S_t = \frac{1}{n} X_t'X_t $ be our estimates of $\operatorname{E}[\mathbf{x}\mathbf{x}']$ over the full sample and in period $t$ respectively. Then we have:

\begin{align*} \hat{\mathbf{b}} &= \frac{1}{T} \sum_t \left( S^{-1} S_t \right) \mathbf{b}_t \end{align*}

This is sort of like an average of the different time specific estimates $\mathbf{b}_t$, but it's a bit different. In some loose sense, you're giving more weight to periods with higher variance of the right hand side variables.

Special case: right hand side variables are time invariant and firm specific

If the right hand side variables for each firm $i$ are constant across time (i.e. $X_{t_1} = X_{t_2}$ for any $t_1$ and $t_2$) then $S = S_t$ for all $t$ and we would have:

$$\hat{\mathbf{b}} = \frac{1}{T} \sum_t \mathbf{b}_t$$

Fun comment:

This is the case Fama and Macbeth where in when they applied this technique of averaging cross-sectional estimates to obtain consistent standard errors when estimating how expected returns vary with firms' covariance with the market (or other factor loadings).

The Fama-Macbeth procedure is an intuitive way to get consistent standard errors in the panel context when error terms are cross-sectionally correlated but independent across time. A more modern technique that yields similar results is clustering on time.

Related Question