You can include a "minimum" number of observations that you think you need to fit your model, and exclude n< this number from cross validation. Obviously, you can't fit a model using just the 1st sample, and you can't really fit a model using the 1st 2 samples. At some reasonable point (5? 10?) you'll have enough observations to fit a valid model, so start at that point.
If your function looked like:
$$a_i=\beta_i\lambda$$
or you perfectly knew $E(x)$ all you would need to do is rewrite your moment restrictions with the inclusion of the restriction. That is:
$$
g_T(b)=E_T\left[\matrix{ y_{i,t} - \beta_{i}\lambda - \beta_{i}x_{i,t}\\
(y_{i,t}-\beta_{i,t}\lambda-\beta_{i}x_{i,t})x_{i,t}
}
\right]
$$
(actually I'd write this out in terms of individual specific moments if $\beta_i$ is truly an individual level variable, but that's a different point).
You then run this through your favorite optimization routine, with the weight matrix you described. The problem with:
$$a_i=\beta_i [\lambda-E(x)]$$
is that you probably do not know $E(x)$, so you have to account for sampling variation. What you should do is define a parameter $\mu=E(x)$, and rewrite your moments:
$$
g_T(b)=E_T\left[\matrix{ y_{i,t} - \beta_{i}[\lambda-\mu] - \beta_{i}x_{i,t}\\
(y_{i,t}-\beta_{i}[\lambda-\mu]-\beta_{i}x_{i,t})x_{i,t}\\
x_{i,t}-\mu
}
\right]$$
with a corresponding weight matrix. You can derive a weight matrix for this, or (my preferred option) use iterated GMM.
This is a just-identified estimator. You can come up with over-identified estimators for this problem by using first-differences if you want.
Best Answer
I'd look at this as a problem in least squared minimization. so you're trying to minimize:
$$ \langle \epsilon^2 \rangle_t= \langle ( Y^t - \sum_i \beta_i X^{t-1}_{i})^2 \rangle_t = \langle (Y_t - \vec{\beta} \cdot X^{t-1})^2 \rangle_t$$
I tend to interpret this type of problem as a Gaussian statistics problem since the solution only involves first and second moments. The idea is that there is a Gaussian joint distribution for $p(y, x_1, x_2 \dots)$ with an arbitrary correlation matrix; you are trying to estimate that correlation matrix, and then computing the conditional distribution $p(y \vert X)=p(y,X)/p(X)$.
In some contexts, this type of problem may be referred to as a Wiener Filtering problem.