Solved – How to compute the partial derivative of the cost function of mean regularized multi task learning

machine learningregression

Background: This is the costfunction of Mean Regularized Multi Task Learning.
This is a typical linear regression learning model, with the only difference being that there's multiple instances of trainings going on at the same time. So X has an additional 3rd dimension and W and Y a 2nd dimension.
X is training data, Y is targets, W is weights, m is number of tasks (3rd dimension), d is number of features, n is number of examples.

$X\in R^{n_i\times d \times m}$,
$Y\in R^{n_i\times m }$,
$W\in R^{d \times m}$

enter image description here

Question:
Given the cost function

$$
J =\min_W \frac{1}{2}||XW-Y||_F^2+\lambda\sum_{i=1}^m||W_i-\frac{1}{m}\sum_{s=1}^mW_s||^2_2
$$
What is
$\frac{\partial}{\partial W}J$?

I need to calculate the partial derivatives that can be used with steepest gradient descent optimization algorithm. I was thinking of calculating the derivative both with respect to a single weight, and the whole matrix. See my answer for my calculations so far.

Best Answer

Just some general advice

  • try to limit the indexation where possible, and use matrix algebra

  • As the dimension of $Y_i$ varies with $i$ best not to store as a matrix. Treat $i$ separately.

  • Alternatively, you could define Y as one very long vector $Y=(Y_1^T,\dots,Y_m^T)^T$. Similarly, $X$ would have $\sum_{i=1}^{m}n_i$ rows and $d$ columns. But then $W$ needs to be redefined as the driect sum $W=\oplus_{i=1}^{m}W_i$. But then $W$ now has structural zeros. Too complicated to work with...

  • don't use indices more than once. For example you use $w_{j,k}$ and also use $j,k$ as summation variables. Should use $w_{r,s}$ instead

So I would write your cost function as $$J=\frac{1}{2}\sum_{i=1}^{m}(X_iW_i-Y_i)^T (X_iW_i-Y_i) +\lambda (W_i-\overline{W})^T (W_i-\overline{W})$$ Where $\overline{W}=\frac{1}{m}\sum_{i=1}^{m}W_i$ Now using the chain rule we have $\frac{\partial e^Te}{\partial W_r}= 2\frac{\partial e}{\partial W_r} e$

So you have $$ \frac{\partial J}{\partial W_r} =X_r^T (X_rW_r-Y_r) +2\lambda \sum_{i=1}^{m} \left(\frac{\partial W_i}{\partial W_r}- \frac{\partial \overline{W} }{\partial W_r}\right)(W_i-\overline{W})$$ $$ =X_r^T (X_rW_r-Y_r) +2\lambda (W_r-\overline{W})$$

This is not the answer you have.

update

One way you can re-express the equations is by setting $ X=\oplus_{i=1}^mX_i $ (which has $\sum_{i=1}^mn_i $ rows and $ dm $ colums, and $ w=(W_1^T,\dots, W_m^T)^T $ and $ Y=(Y_1^T,\dots, Y_m^T)^T $. We can also re-express the penalty term as $\sum_{i=1}^m(W_i-\overline {W})^T (W_i-\overline {W})=w^Tw-m\overline {W}^T\overline {W} =w^T (I-m^{-1}G^TG) w $ where $ G$ is the $ d\times md $ matrix which calculates the totals for $ w $. So the $ k $ th row of $ G $ has ones in columns $k, d+k, 2d+k, \dots, (m-1) d+k $ and zeroes everywhere else. We can also write the other factor as $\frac{1}{2}(Y-Xw)^T (Y-Xw) $. Hence an explicit solution is given as

$$\hat {w}=\left [X^TX +2\lambda (I-m^{-1} G^TG)\right]^{-1} X^TY $$

It will probably be more efficient to implement by first use the woodbury matrix identity, as $ X^TX $ is a block- diagonal matrix.