From what you've told me, I would almost certainly predict that if you: (1) ran standard two stage least squares estimator and (2) tried to compute robust standard errors, then some of the standard errors for your coefficients would either be reported as zero or correctly reported as blank/uncomputable.
My guess at what is happening is that some of your coefficients are exactly identified, hence a heteroskedastic robust covariance matrix cannot be computed for those coefficients. Go to invert that to compute your optimal weighting matrix and Matlab will tell you that the matrix singular or near singular.
Some background
From what you're telling me, Z
matrix is full column rank. What you're also telling me is that Z .* U
is less than full column rank (where U = u * ones(1, L)
and u
is your residuals). That is:
$$ \underbrace{Z = \left[\begin{array}{ccc} z_{1,1} & z_{1,2} & z_{z,3} \\ z_{2,1} & z_{2,2} & z_{2,3} \\ \ldots & \ldots & \ldots \end{array} \right]}_{\text{full column rank}} \quad \quad \underbrace{A = \left[\begin{array}{ccc} z_{1,1}u_1 & z_{1,2}u_1 & z_{z,3}u_1 \\ z_{2,1}u_2 & z_{2,2}u_2 & z_{2,3}u_2 \\ \ldots & \ldots & \ldots \end{array} \right]}_{\text{rank deficient?}}$$
The square of the singular values of the matrix A on the right are singular values of $Z'\Omega Z$, that is, $Z' \Omega Z$ has zeros (or near zeros) for singular values if and only if A has zeros or near zeros. (Basically: effective rank = # of singular values above some threshold )
Something I would do is singular value decompositions on both $Z$ and $A$ (or Z'*Z and A'*A, basically same thing for these purposes). Singular values near zero are indicative of a matrix, that from a numerical linear algebra, is near rank deficient. Does $Z$ already have problems without Matlab spitting out an error? Or do the singular values only go to zero (or near zero) once multiplying by the residuals $u_i$?
An educated guess at what might be happening...
Imagine you have 3 observations:
$$ Z = \left[ \begin{array}{cc} 1 & 0 \\ 1 & 1 \\ 1 & 1 \end{array} \right]$$
Z is full column rank because of observation 1. But if the residual $u_1$ for observation 1 is zero...
$$ A = \left[\begin{array}{cc} z_{11}u_1 & z_{12}u_1 \\ z_{21} u_2 & z_{22} u_2 \\ z_{31}u_3 &z_{32}u_3 \end{array}\right] = \left[\begin{array}{cc} 0 & 0 \\ u_2 & u_2 \\ u_3 & u_3 \end{array}\right]$$
Then $A$ is column rank 1 and $Z'\Omega Z = A'A$ will be rank deficient.
Intuitively what would be happening is that if some of your initial b
estimates are exactly identified the estimated residual is zero and you can't form an estimate of the variance or covariance! There are 0 leftover degrees of freedom. Mechanically, what gets pumped out is an estimate of zero, then you go to invert that, and Matlab blows up by telling you the matrix is singular or close to singular.
For reference, some MATLAB code I've written/tested to do what I imagine you're trying to do:
Warning, this code does NOT check for rank deficiency. Let endo
be your endogenous regressors, exo
be your exogenous regressors (constant ones(N, 1)
goes here), and instruments
be your instruments. Initial setup:
X = [endo, exo]; % N x (L + K)
Z = [exo, instruments]; % N x (K + L2)
Calculate your basic ingredients so you don't need to repeat these expensive operations:
ZprimX = Z' * X;
Zprimy = Z' * y;
ZprimZ = Z' * Z;
Get your initial estimates (you will use for weighting matrix):
%%% Method 1 (intuitive):
% stage1_estimate = Z \ endo;
% endo_hat = Z * stage1_estimate;
% b_2sls = [endo_hat, exo] \ y;
%%% Method 2: (slick)
b_2sls = (Z * (ZprimZ \ ZprimX)) \ y;
Get residuals:
u_2sls = y - X * b_2sls;
Calculate weighting matrix inverse
% Basic version
% W_inv = (u_2sls' * u_2sls / N) * ZprimZ / N;
% Robust version
temp = (u_2sls * ones(1, K + L2)) .* Z;
W_inv = temp' * temp / N;
Calculate your estimate b:
%%% Method 1 (intuitive)
% W = inv(W_inv);
% A = ZprimX' * W * ZprimX;
% c = ZprimX' * W * Zprimy; % = (Zprimy' * W * ZprimX)'
% results.b = inv(A) * c;
%%% METHOD 2 (slick):
% This is a way without matrix inversion to do it.
% Solve linear systems rather than invert W_inv
temp = W_inv \ ZprimX; % Solve inv(W_inv) * ZprimX = W * ZprimX
A = ZprimX' * temp;
c = (Zprimy' * temp)';
results.b = A \ c;
results.b_endo = results.b(1:L);
results.b_exo = results.b(L+1:end);
Best Answer
Least squares estimator in the classical linear regression model is a Method of Moments estimator.
The model is
$$\mathbf y = \mathbf X\beta + \mathbf u$$
Instead of minimizing the sum of squared residuals, we can obtain the OLS estimator by noting that under the assumptions of the specific model, it holds that ("orhtogonality condition")
$$E(\mathbf X' \mathbf u)= \mathbf 0$$
$$\implies E(\mathbf X'( \mathbf y - \mathbf X\beta))=\mathbf 0 \implies E(\mathbf X'\mathbf y)=E(\mathbf X'\mathbf X)\beta$$
$$\implies \beta = \left[E(\mathbf X'\mathbf X)\right]^{-1}E(\mathbf X'\mathbf y)$$
So if we knew the true expected values (and our assumptions were correct), we could calculate the true value of the unknown coefficient.
With non-experimental data, we don't. But we know that if our sample is ergodic-stationary (and i.i.d. samples are), then expected values are consistently estimated by their sample analogues, the corresponding sample means. Hence we have an "acceptable" estimator in
$$\hat \beta = \left[((1/n)\mathbf X'\mathbf X)\right]^{-1}((1/n)\mathbf X'\mathbf y) = (\mathbf X'\mathbf X)^{-1}\mathbf X'\mathbf y $$
which is the same estimator we will obtain if we minimize the sum of squared residuals.
If you reverse the calculations, and noting that the residuals are a function of $\hat \beta$, $\mathbf {\hat u} = \mathbf {\hat u(\hat \beta)}$ you will find that $\mathbf X' \mathbf {\hat u} (\hat \beta) = \mathbf 0$. Divide by $n$ for this to look like a sample mean.
So "we choose those estimates that make the sample obey what we assumed the population obeys". And we do that because we accept that the sample is representative of the population, so it should "behave like" the population (as we assumed the latter to behave).
As for $GMM$, each orthogonality condition is an equation. If you have $m$ equations and $k<m$ unknown coefficients, then the system of equations is "over-identified" and no exact solution exists, there is nothing more to it.
Hansen contrasted this with the original use of "Method of Moments": if a distribution is characterized by, say, three unknown parameters, the MoM tactic is to estimate using the sample the first three moments of the distribution (which appear in equations involving these parameters), and obtain an exactly-identified system of equations. See how this works in this answer, as an example.