If you compare the standard errors of the OLS coefficients with the White correction, versus the ML estimates with the variance estimated with the sandwich estimator, which standard errors do you expect to be the biggest? And is it correct that both are robust to heteroscedasticity?
Solved – Robust OLS versus ML with sandwich estimator
least squaresmaximum likelihoodrobustsandwichstandard error
Related Solutions
One way is to run a linear regression and run the robust variance estimator on top of that to guard against getting biased estimates.
An important point here is that having pockets of correlated data does not bias your estimates in a linear model - it results in having inflated standard errors. In a non-linear model (e.g. logistic regression), you can get biased estimates, since the population average effect is, in general, different from the individual specific effect, which is not the case with a linear model. More information on this distinction is in this answer
Can we take the clustering effect into account with the sandwich estimator?
From the title, I assume you're talking about using Huber White sandwich standard errors for your confidence intervals and $p$-values. These do impose a diagonal covariance matrix but are robust to the diagonal entries possibly being different - for that reason they were originally used when there is possible heteroskedasticity in your errors, which means that the error variance in non-constant. But, using a slight modification of the Huber-White standard errors where the "meat" of the sandwich is replaced with an empirical estimate of the covariance matrix within a cluster (still called Huber-White standard errors) provides inference that is robust to non-independence within a cluster (but not between clusters!) - this modification is described pretty clearly in a 2006 paper in The American Statistician called On The So-Called “Huber Sandwich Estimator” and “Robust Standard Errors” by David Freedman.
This procedure robust to non-independence within a cluster in the sense that they will still give you asymptotically unbiased inference (i.e. the confidence levels and $p$-values will be right) even if there is correlation within a cluster. I suspect this is what your code labeled 'Empirical Estimator' code is doing.
I've fitted two separate GEE models one with exchangeable varcov matrix and the other one with the robust variance estimator (also known as Huber-White, Sandwich Estimator, or empirical variance estimator). The point is under both models I get the same estimated variance per each covariate, but my GEE exchangeable estimates leads to much larger beta estimates that are also statistically significant whereas similar beta covariates are not significant in GEE with robust varcov estimator. I wonder why it happens?
In general, the GEE model solves the equation
$$ \sum_{i=1}^{n} \frac{ \partial \mu_i }{ \partial \beta } V(\alpha)^{-1} (y_i - \mu_i) = 0$$
as a function of the regression coefficients, $\beta$, where $\mu_i = x_i \beta$ is the expected values of the cluster $i$ response, $y_i$, given the predictors $x_i$ under the specified model. $V(\alpha)^{-1}$ is the "working" covariance matrix of the of the elements of cluster $i$. (note that $\mu_i=x_i \beta$ because we're dealing with a linear model but GEEs can more generally use a link function so that $\mu_i=g(x_i \beta))$
A key point here is that when you change the working covariance, you change the estimating equation, therefore the $\beta$ that solves it will be different. For example, if $V$ was $\sigma^2$ down the diagonal and $0$ off the diagonal and $\mu_i = x_i \beta$ as it does here, then the GEE estimator is the least squares estimator, which will not solve that equation in the exchangeable case. So it is no surprise that you're getting different parameter estimates. It may be a coincidence that you're getting the same standard errors.
In your situation, I'd suggest reporting the results that used the exchangeable covariance matrix since. While GEE-based inference in consistent even when you're misspecified the correlation structure, it is known that GEE estimators are more efficient when you use a more appropriate covariance structure and, if you have evidence that there are large intra-class correlations within a school, then the exchangeable correlation will probably provide a much closer approximation to the true association structure.
Best Answer
In Stata's User Guide it is stated that Stata uses the "White" formula for the heteroskedasticity-robust variance-covariance matrix of the estimator. Then, at least in the context of the Normal Linear Regression Model $$y_i = \mathbf x_i'\beta +u_i$$ we should obtain the exact same results using either OLS or ML.
With observations and errors i.i.d. (hence homoskedastic also), and all the other nice assumptions, the (full) asymptotic variance-covariance matrix of the ML estimator is
$$\operatorname {Avar}\left[\sqrt n(\hat \beta_{ML}-\beta)\right] = \Big[E(H_i)\Big]^{-1}E(s_is_i')\Big[E(H_i)\Big]^{-1}$$
Where $H_i$ is the Hessian matrix (2nd derivative of log-likelihood) for the $i$-th observation,
$$H_i = -\frac1{\sigma^2}\mathbf x_i\mathbf x_i'$$
and $s_i$ is the score vector (1st derivative of the loglikelihood) for the $i$-th observation,
$$s_i = \frac1{\sigma^2}u_i\mathbf x_i$$ Under homoskedasticity, the information matrix equality holds, $E(s_is_i')=-E(H_i)$ and so the expression simplifies to
$$\operatorname {Avar}\left[\sqrt n(\hat \beta_{ML}-\beta)\right] = -\Big[E(H_i)\Big]^{-1} = \Big[E(s_is_i')\Big]^{-1}$$
two equalities that provide the two alternative estimator variance formulas we use in ML estimation (using sample means instead of expected values). Note that in finite samples these two estimates do not give identical results -but for large samples, they should be "close".
But let's say we suspect the existence of heteroskedasticity and we want to account for it, not by attempting to specify a functional form for it (as is the older approach) but by calculating only a "heteroskedasticity-robust" variance-covariance matrix for our tests and inference, obtaining the coefficient estimates themselves in the usual way.
In such a case, and in order to by-pass the problem of estimating $n$ different variances, we use the result that, at least for some forms of misspecification (heteroskedasticity included), specifying the wrong log-likelihood (in our case, ignoring the heteroskedasticity), nevertheless still gives us a consistent ML estimator, the "Quasi-MLE", whose asymptotic variance-covariance matrix is consistently estimated by the full version of the theoretical asymptotic variance of the misspecified model, (i.e. we ignore the "information matrix equality" that previously simplified matters). In other words,
$$\operatorname {\hat Avar}_{Robust}\left[\sqrt n(\hat \beta_{ML}-\beta)\right] \\= \Big[\frac 1n\sum_{i=1}^n\hat H_i\Big]^{-1}\left(\frac 1n\sum_{i=1}^n(\hat s_i \hat s_i')\right)\Big[\frac 1n\sum_{i=1}^n\hat H_i\Big]^{-1}$$
$$=n\cdot \Big[\frac 1{\hat \sigma^2}\sum_{i=1}^n\mathbf x_i\mathbf x_i'\Big]^{-1}\left(\frac 1{(\hat \sigma^2)^2}\sum_{i=1}^n\hat u_i^2 \mathbf x_i\mathbf x_i'\right)\Big[\frac 1{\hat \sigma^2}\sum_{i=1}^n\mathbf x_i\mathbf x_i'\Big]^{-1}$$
$$=n\cdot \left(\mathbf X'\mathbf X\right)^{-1}\left(\sum_{i=1}^n\hat u_i^2 \mathbf x_i\mathbf x_i'\right)\left(\mathbf X'\mathbf X\right)^{-1}$$
Since in the Normal linear regression model, the ML estimator coincides with the OLS estimator for the coefficients, the residual series will be identical, so the above expression is numerically equal to the heteroskedasticity-robust variance covariance matrix of the (centered and scaled) OLS estimator.
But the OP says in a comment that he obtains different results for the two cases with his software. This may be due to some "finite-sample" bias correction that creeps in in the one case but not in the other, which nevertheless is not part of the original "White" expression, but was proposed later by White himself but also by Davidson and MacKinnon based on results from Monte-Carlo simulations. Alternatively, it may be the case, that the actual code written for reflecting the above equation for the ML estimation, instead of ignoring the variance estimate, it calculates $\hat \sigma^4$ directly and doesn't just square the estimated variance, using a formula that might produce different results than squared estimated variance, and so the variances do not really cancel out (note that in the OLS framework, the variances are absent from the beginning).