Solved – Combining two distributed regression models

multiple regressionregression

I am facing a problem which I'm not sure has a specific answer – theproblem of distributed regression model building: I've got some k stations, each having the same data structure (that is, all of them have the same columns but of course different data) and each able to compute it's own regression model. What I'm looking to do is building a unified regression model, which will have the same $\beta$ as if I would have had combined all the data together and then calculated the regression coefficients.

To simplify, let's say I want to build a regression model of variable $y$ vs. predictors matrix $X=(x_{sex}, x_{age})$. problem is, my data is split to two chunks: $y_1$ vs. $X_1$ and $y_2$ vs. $X_2$.

What I can do:

  • compute the linear regression of $y_i$ vs. $X_i$

  • use the results of each regression (i.e play with $\hat{\beta}^{(i)}$, residuals and so on)

What I cannot do:

  • combine the datasets

  • see the contents of the $X_i$ matrices or the $y_i$ vectors.

Any ideas? I'll greatly appreciate references to papers.

Best Answer

I think the question is whether you want to globally solve the least-squares problem using your distributed $k$ nodes, or rather impose a constraint that first each node solve for itself, then combine all solutions into a single one.

If the situation is the former (i.e., you want a distributed algorithm for globally solving least squares), then you can find a survey in Distributed Least-Squares Iterative Methods in Networks: A Surve. Some of the methods are intuitively very clear. For example, if you look at all the iterative least square algorithms (e.g., those base on conjugate gradient methods). When translated into actual math, they perform vector-vector and vector-matrix products. These operations are known to have efficient distributed implementations, and so it is logical that distributed least-square algorithms using this exist.

If the situation is the latter (i.e., you want a system where each node solves a model locally, then they are combined), then again there are two cases, in neither of which can I think of a solution that would result in exactly the same solution that would be obtained globally:

  • If you would like to use linear estimators, then hierarchical linear modeling seems a reasonable approach.

  • You might find that using random forests yields good results. Random forests work by averaging multiple trees seeing "different versions" of the data. Your local nodes already do that.

In any case, I don't think that there is a simple formula that will solve your problem. If you actually need this, you might consider using a framework that already does distributed machine learning, e.g., Spark MLib.