Solved – Multivariate regression with weighted least squares in python

least squaresmultivariate regressionpythonregression

I have a multivariate regression problem that I need to solve using the weighted least squares method.
In particular, I have a dataset X which is a 2D array. It consists of a number of observations, n, and each observation is represented by one row. Each observation also consists of a number of features, m. So that means each row has m columns. Therefore my dataset X is a n×m array.

Given a test data observation, multivariate regression should produce a function that predicts the response vector y, which is a 2D array as well. This function will consist of m coefficients, i.e. one coefficient/parameter for each of the m features of the test input.

This solution is already available here: Statsmodels' WLS, except that they don't support a 2D response vector yet. In other words, when I fit the data, I have to provide my dataset X, but can only provide a 1D array as the response y.

In addition, I also need a 2D weights vector, similar in dimension to the response vector y.

Is there a Python implementation of WLS multivariate regression where y and the weights can be 2D vectors?

Or if not a direct implementation, can any of the existing packages be used as an implementation somehow, by a small amount of adjustment?

Edit

To make my question clearer, these are the parameters I would give in and the results I would need to get out:

Input:

  • X : a 2D dataset, like 10×3, which is 10 observations with 3 features each.

  • y : which is also a 2D vector, in this case 10×2. In other words, a 2-value response vector for each observation. (I'm doing classification and there are two possible classes).

  • weights: a 2D response vector which is also 10×2, like y.

The 10 above is an arbitrary number of rows. Ultimately, however many observations I have, that's just what the number of rows is going to be, for all vectors above.

Output I need:

  • the coefficients of the regression. Given that my response and weight vectors are 2D, I believe the coefficients would also be a 2D array, probably 3×2 or 2×3.

Best Answer

It's still not entirely clear to me what you want to do, but if your weights are 1d, you can (ab)use sm.WLS to do this.

import numpy as np
import statsmodels.api as sm
np.random.seed(12345)

N = 30

X = np.random.uniform(-20, 20, size=(N,10))
beta = np.random.randn(11)
X = sm.add_constant(X)

weights = np.random.uniform(1, 20, size=(N,))
weights = weights/weights.sum()

y = np.dot(X, beta) + weights*np.random.uniform(-100, 100, size=(N,))

Y = np.c_[y,y,y]

mod = sm.WLS(Y, X, weights=1/weights).fit()

If your weights are not 1d, WLS will indeed break, because it's not designed for this case. You can use a loop over WLS or just roll your own solution depending on what exactly you want to do.

weights = np.random.uniform(1, 20, size=(N,3))
weights = weights/weights.sum(0)
y = np.dot(X, beta)[:,None] + weights*np.random.uniform(-100, 100, size=(N,3))

This is the entirety of the WLS solution for each equation, assuming this is what you want to do

beta_hat = np.array([np.linalg.pinv(1/weights[:,i,None]**.5 * X).dot(y[:,i]) for i in range(3)])