Solved – Bootstrapping Generalized Least Squares

bootstrapgeneralized-least-squaresresearch-design

Scenario:

Consider the use of bootstrapping to estimate the distribution of model parameters fitted per a linear or nonlinear generalized least squares model. In particular, assume there is a covariance matrix $C$ of the errors, which may not even be diagonal, so errors may be neither independent nor identically distributed.

The goal is to produce an accurate estimate of the distribution of the estimated model parameters in order to support inference or make predictions accounting for uncertainty. That would either be in the "lower fidelity" form of an estimated covariance matrix, or via generation of data points to feed an empirical likelihood calculation not assuming Normality. For simplicity, I am assuming that the covariance matrix $C$ is known exactly, not imperfectly estimated.

Proposed Bootstrap Method:

Each of the $n$ data points in the full sample is randomly sampled with replacement in the bootstrap with equal probability. For each bootstrap sample, the rows and columns corresponding to data points which were not selected for the sample are deleted from the covariance matrix, $C$. For each pair of data points $(i,j)$ which are included at least once in the sample, that entry of the covariance matrix for the bootstrap sample would be the corresponding entry in $C$, divided by the square root of the product of the number of times data points $i$ and $j$ appear in the bootstrap sample. The bootstrap sample parameter estimate for that bootstrap sample is the solution of the generalized least squares problem using the covariance matrix for that bootstrap sample, as described in the previous sentence.

The parameter estimates from all the bootstrap samples are fed into an unweighted (i.e., equally weighted) sample (or other method) covariance estimate. For use in empirical likelihood, the parameter estimate calculated from each bootstrap sample is treated as an i.i.d. (equally weighted) data point for use in empirical likelihood calculations.

Discussion of Proposed Bootstrap Method:

If $C$ were a scalar multiple of the identity matrix, i.e., the errors were i.i.d., I believe the method described is well-founded. But, they are not i.i.d.. I believe that it is appropriate to not weight the data point selection probabilities used to generate the bootstrap data samples, or to weight the parameter estimates calculated for each bootstrap sample across bootstrap samples, because the covariance matrix used in the least squares calculation on each bootstrap sample is already weighted per the (inverse of) the covariance matrix. A data point which has $1/2$ the variance of another data point is in effect equivalent to appearing twice as often in the bootstrap sample, if equal weighting (covariance) were used instead in the least squares calculation. So i think any weighting is accounted for in the covariance matrix. Therefore, any (unequal) weighting used in the probability of selection for the bootstrap sample, or weighting for estimated parameter estimate covariance calculation or empirical likelihood, would be "double counting". But is there perhaps an "injustice" caused by the non-zero covariance across data points, so that because of this covariance, data points are in effect partially sampled for the bootstrap samples, even though nominally, a point is either in a bootstrap sample or not? It is reasonable to assume that in the real world, any of the data points, whether low or high covariance, and whether correlated or not to other data points, would be equally likely to "occur" (appear in the full (not bootstrap) sample) data set.

Generalization to Vector Data Points (Multivariate Generalized Least Squares):

Everything is as above, but this is now multivariate least squares, i.e., the LHS data point values are $m$ by $1$ vectors. The covariance matrix $C$ is now $m*n$ by $m*n$, and might be completely general, other than being symmetric psd. Vector data points are randomly selected with replacement in the bootstrap sample on an equally likely per vector data point basis (a vector data point is either in or not in the bootstrap sample, not partially in). The covariance matrix used for a bootstrap sample is adjusted as before, except that it is now on a block basis, where there is one $m$ by $m$ block per vector data point. So all entries in a given covariance block $(i,j)$ receive the same adjustment factor, which is to divide by the square root of the product of the number of times vector data points $i$ and $j$ appear in the bootstrap sample Everything else is as described before, with the same rationale.

Generalization to Other than Least Squares:

Instead of generalized least squares, some other loss function and possibly regularization factor(s) might be used, but with the covariance matrix still used to "weight" as per generalized least squares (as shown in this answer Is there a "generalized least norm" equivalent to generalized least squares?) I.e, loss is a function of the $m * n$ raw residuals, adjusted to be the product of (inverse of Cholesky factor of covariance) and raw residuals, i.e., so that covariance of the adjusted residuals is the identity matrix..

Alternative Method Based on Randomly Sampling Adjusted Residuals:

Adjust the residuals, as discussed in the preceding paragraph, so that the adjusted residuals are i.i.d. Randomly sample with replacement residuals for a bootstrap sample with equal probability. Perform ordinary least squares on the bootstrap sample. I think this might pervert the vector nature of the data points – I think the adjusted residuals do not even correspond to particular vector data points, so the adjusted residuals can not be selected for the bootstrap sample on a vector basis. Overall, despite the apparent attractiveness of bootstrap samples being selected from an i.i.d. sample, I'm not 'buying" this method, at least in the vector data point case, which is really the only case I care about. In that case, in the real world, a vector data point is either in or not in the data set; there are no data points with partial vectors.

Thoughts?

Best Answer

As long as your data set is not infinitely large, a bootstrapping approach would likely entail overlaying alternative realizations of biased parameter values and standard errors on top of one another. Shrinkage approaches which attempt to nudge off-diagonals in the upper(lower) triangular toward the diagonal may help, but the fix might best be hinged to corrections to the eigendensity either through the Marcenko-Pastur law or Tracy-Widom law. (see my random matrix theory lecture). The issues will be central to the eigendensity of the Hessian, which is inverted for updating coefficients as well as obtaining standard errors after convergence. I believe the issues inherent in what you are proposing are potentially inadmissible, as they would scale with sample size.