Solved – Difference between averaging data then fitting and fitting the data then averaging

errorfittingmean

If any, between fitting a line to multiple separate "experiments" then averaging the fits, or averaging the data from the separate experiments then fitting the averaged data. Let me elaborate:

I perform computer simulations which generate a curve, shown below. We extract a quantity, lets call it "A" by fitting the linear region of the plot (long times). The value is simply the slope of the linear region. There is of course an error associated with this linear regression.

We typically run 100 or so of these simulations with different initial conditions to calculate an average value of "A". I have been told that it is better to average the raw data (of the plot below) into groups of say 10, then fit for "A" and average those 10 "A"'s together.

I have no intuition for whether there's any merit to that or if it is any better than fitting 100 individual "A" values and averaging those.

Best Answer

Imagine we're in a panel data context where there's variation across time $t$ and across firms $i$. Think of each time period $t$ as a separate experiment. I understand your question as whether it's equivalent to estimate an effect using:

Cross-sectional variation in time series averages.
Time series averages of cross-sectional variation.

The answer in general is no.

The setup:

In my formulation, we can think of each time period $t$ as a separate experiment.

Let's say you have a balanced panel of length $T$ over $n$ firms. If we break each time period apart $(X_t, \mathbf{y}_t)$ etc... we can write the overall data as:

$$ Y = \begin{bmatrix} \mathbf{y}_1 \\ \mathbf{y}_2 \\ \ldots \\ \mathbf{y}_n \end{bmatrix} \quad \quad X = \begin{bmatrix} X_1 \\ X_2 \\ \ldots \\ X_n \end{bmatrix} $$

Average of fits:

\begin{align*} \frac{1}{T} \sum_t \mathbf{b}_t &= \frac{1}{T} \sum_t \left(X_t'X_t \right)^{-1} X_t' \mathbf{y}_t \\ &= \frac{1}{T} \sum_t S^{-1}_t \left( \frac{1}{n} \sum_i \mathbf{x}_{t,i} y_{t,i}\right) \quad \text{where } S_t = \frac{1}{n} \sum_i \mathbf{x}_{t,i} \mathbf{x}_{t,i}' \end{align*}

Fit of averages:

This isn't in general equal to the estimate based upon cross-sectional variation of time series averages (i.e. the between estimator).

$$ \left( \frac{1}{n} \sum_i \bar{\mathbf{x}}_i \bar{\mathbf{x}}_i' \right)^{-1} \frac{1}{n} \sum_i \bar{\mathbf{x}}_i \bar{y}_i $$

Where $\bar{\mathbf{x}}_i = \frac{1}{T} \sum_t \mathbf{x}_{t, i}$ etc...

Pooled OLS estimate:

Something perhaps useful to think about is the pooled OLS estimate. What is it? \begin{align*} \hat{\mathbf{b}} &= \left(X'X\right)^{-1}X'Y \\ &= \left( \frac{1}{nT} \sum_t X_t'X_t \right)^{-1} \left( \frac{1}{nT} \sum_t X_t' \mathbf{y}_i \right) \end{align*} Then use $\mathbf{b}_t = \left(X_t'X_t \right)^{-1}X_t' \mathbf{y}_i$ \begin{align*} &= \left( \frac{1}{nT} \sum_t X_t'X_t \right)^{-1} \left( \frac{1}{nT} \sum_t X_t'X_t \mathbf{b}_t \right) \end{align*}

Let's $S = \frac{1}{nT} \sum_i X'X $ and $S_t = \frac{1}{n} X_t'X_t $ be our estimates of $\operatorname{E}[\mathbf{x}\mathbf{x}']$ over the full sample and in period $t$ respectively. Then we have:

\begin{align*} \hat{\mathbf{b}} &= \frac{1}{T} \sum_t \left( S^{-1} S_t \right) \mathbf{b}_t \end{align*}

This is sort of like an average of the different time specific estimates $\mathbf{b}_t$, but it's a bit different. In some loose sense, you're giving more weight to periods with higher variance of the right hand side variables.

Special case: right hand side variables are time invariant and firm specific

If the right hand side variables for each firm $i$ are constant across time (i.e. $X_{t_1} = X_{t_2}$ for any $t_1$ and $t_2$) then $S = S_t$ for all $t$ and we would have:

$$\hat{\mathbf{b}} = \frac{1}{T} \sum_t \mathbf{b}_t$$

Fun comment:

This is the case Fama and Macbeth where in when they applied this technique of averaging cross-sectional estimates to obtain consistent standard errors when estimating how expected returns vary with firms' covariance with the market (or other factor loadings).

The Fama-Macbeth procedure is an intuitive way to get consistent standard errors in the panel context when error terms are cross-sectionally correlated but independent across time. A more modern technique that yields similar results is clustering on time.

Related Solutions

Solved – Difference between simultaneous fitting and separate fitting

What you probably want to do is first run a PCA and extract only the most significant principal components, and thereafter do a fit to only those principal components.

Generally speaking, your problem falls under the domain of dimensionality reduction, see this wiki: http://en.wikipedia.org/wiki/Dimension_reduction

PCA is probably the simplest method to do so, and additionally it isn't "true" dimensionality reduction since strictly speaking you still need all the original data to generate your principal components, but this should reduce the complexity of your problem somewhat.

Coincidentally, one large pitfall of doing separate fits is that the variables you are fitting over may be collinear, so that the combined fit barely adds any value over a fit over one or the other variable.

Solved – When fitting a curve, how to calculate the 95% confidence interval for the fitted parameters

The problem with linearizing and then using linear regression is that the assumption of a Gaussian distribution of residuals is not likely to be true for the transformed data.

It is usually better to use nonlinear regression. Most nonlinear regression programs report the standard error and confidence interval of the best-fit parameters. If yours doesn't, these equations may help.

Each standard error is computed using this equation:

SE(Pi) = sqrt[ (SS/DF) * Cov(i,i) ]

Pi : i-th adjustable(non-constant) parameter
SS : sum of squared residuals
DF : degrees of freedom (the number of data points minus number of parameters fit by regression)
Cov(i,i) : i-th diagonal element of covariance matrix
sqrt() : square root

And here is the equation to compute the confidence interval for each parameter from the best-fit value, its standard error, and the number of degrees of freedom.

From [BestFit(Pi)- t(95%,DF)*SE(Pi)]  TO  [BestFit(Pi)+
 t(95%,DF)*SE(Pi)]

BestFit(Pi) is the best fit value for the i-th parameter
t is the value from the t distribution for 95% confidence for the specified number of DF.
DF is degrees of freedom.

Example with Excel for 95% confidence (so alpha = 0.05) and 23 degrees of freedom: = TINV(0.05,23) DF equals degrees of freedom (the number of data points minus number of parameters fit by regression)