Your "assume also" clause equates two quadratic forms in $\mathbb{R}^n$ (with $\mathrm{y}=(y_1,y_2,\ldots,y_n)$ the variable). Since any quadratic form is completely determined by its values at $1+n+\binom{n+1}{2}$ distinct points, their agreement at all points of $\mathbb{R}^n$ is far more than needed to conclude the two forms are identical, whence their coefficients must be the same.
The coefficients of $y_1^2$ are $1/\sigma^2$ and $1/\nu^2$, whence $\sigma=\pm \nu$. We always stipulate that $\sigma$ and $\nu$ are nonnegative, implying $\sigma=\nu$. (The "real" parameter should be considered to be $\sigma^2$ or $1/\sigma^2$ rather than $\sigma$ itself.)
The linear terms in $y_i$ are both proportional to $b_0+b_1 x_i = a_0 + a_1 x_i$. Letting $\mathrm{1} = (1,1,\ldots, 1)$ and $\mathrm{x} = (x_1, x_2, \ldots, x_n)$, we conclude
$$(a_0 - b_0)\mathrm{1} + (a_1 - b_1)\mathrm{x} = \mathrm{0}.$$
Thus either
$\mathrm{1}$ and $\mathrm{x}$ are linearly independent, which by definition implies both $a_0 = b_0$ and $a_1 = b_1$, or
$\mathrm{1}$ and $\mathrm{x}$ are linearly dependent, which means $x_1 = x_2 = \cdots = x_n = x$, say. In that case
- If $x \ne 0$, $a_0 - b_0 = (a_1 - b_1) x$ determines one of $(a_0, a_1, b_0, b_1)$ in terms of the other three, or
- Otherwise $a_0=b_0$ and $a_1$ and $b_1$ could have any values.
In case (1) all parameters are uniquely determined: this is the identifiable model. In case (2) $\sigma = \nu$ is identifiable no matter what and various linear combinations of $(a_0,a_1,b_0,b_1)$ can be identified.
Evidently, linear independence of $\mathrm{x}$ and $\mathrm{1}$ is both necessary and sufficient for identifiability.
This criterion easily generalizes to multiple regression, where the ordinary least squares model is identifiable if and only if the design matrix $X$ (whose columns are formed from $\mathrm{1}, \mathrm{x}$, and any other variables in any order) has full rank: that is, there is no linear dependence among its columns.
It's not clear whether you want estimates of height for each individual man and woman (more of a classification problem) or to characterize the distribution of heights of each sex. I will assume the latter. You also do not specify what additional information you are using in your model, so I will confine myself to addressing the case where you only have height data (and sex data, in the case of non-US citizens).
I recommend simply fitting a mixture of distributions to the height data from the US only, because the distributions of height in men and women are reasonably different. This would estimate the parameters of two distributions that when summed together best describe the variation in the data. The parameters of these distributions (mean and variance, since a Gaussian distribution should work fine) give you the information you are after. The R packages mixtools
and mixdist
let you do this; I'm sure there are many more as well.
This solution may seem odd, because it leaves out all the information you have from outside the US, where you have know the sex and height of each individual. But I think it is justified because:
1) We have a very strong prior expectation that men are on average taller than women. Wikipedia's List of average human height worldwide shows not even one country or region where women are taller than men. So the identity of the distribution with the greater mean height is not really in doubt.
2) Integrating more specific information from the non-US data will likely involve making the assumption that the covariance between sex and height is the same outside the US as inside. But this is not entirely true - the same Wikipedia list indicates that the ratio of male to female heights varies between approximately 1.04 and 1.13.
3) Your international data may be much more complicated to analyse because people in different countries have wide variation in height distributions as well. You may therefore need to consider modelling mixtures of mixtures of distributions. This may also be true in the US, but it is likely to be less of a problem than a dataset that includes the Dutch (mean height: 184 cms) and Indonesians (mean height: 158 cms). And those are country-level averages; subpopulations differ to an even degree.
Best Answer
I think the question is whether you want to globally solve the least-squares problem using your distributed $k$ nodes, or rather impose a constraint that first each node solve for itself, then combine all solutions into a single one.
If the situation is the former (i.e., you want a distributed algorithm for globally solving least squares), then you can find a survey in Distributed Least-Squares Iterative Methods in Networks: A Surve. Some of the methods are intuitively very clear. For example, if you look at all the iterative least square algorithms (e.g., those base on conjugate gradient methods). When translated into actual math, they perform vector-vector and vector-matrix products. These operations are known to have efficient distributed implementations, and so it is logical that distributed least-square algorithms using this exist.
If the situation is the latter (i.e., you want a system where each node solves a model locally, then they are combined), then again there are two cases, in neither of which can I think of a solution that would result in exactly the same solution that would be obtained globally:
If you would like to use linear estimators, then hierarchical linear modeling seems a reasonable approach.
You might find that using random forests yields good results. Random forests work by averaging multiple trees seeing "different versions" of the data. Your local nodes already do that.
In any case, I don't think that there is a simple formula that will solve your problem. If you actually need this, you might consider using a framework that already does distributed machine learning, e.g., Spark MLib.