Solved – How to compute the partial derivative of the cost function of mean regularized multi task learning

machine learningregression

Background: This is the costfunction of Mean Regularized Multi Task Learning.
This is a typical linear regression learning model, with the only difference being that there's multiple instances of trainings going on at the same time. So X has an additional 3rd dimension and W and Y a 2nd dimension.
X is training data, Y is targets, W is weights, m is number of tasks (3rd dimension), d is number of features, n is number of examples.

$X\in R^{n_i\times d \times m}$,
$Y\in R^{n_i\times m }$,
$W\in R^{d \times m}$

enter image description here

Question:
Given the cost function

$$
J =\min_W \frac{1}{2}||XW-Y||_F^2+\lambda\sum_{i=1}^m||W_i-\frac{1}{m}\sum_{s=1}^mW_s||^2_2
$$
What is
$\frac{\partial}{\partial W}J$?

I need to calculate the partial derivatives that can be used with steepest gradient descent optimization algorithm. I was thinking of calculating the derivative both with respect to a single weight, and the whole matrix. See my answer for my calculations so far.

Best Answer

Just some general advice

try to limit the indexation where possible, and use matrix algebra
As the dimension of $Y_i$ varies with $i$ best not to store as a matrix. Treat $i$ separately.
Alternatively, you could define Y as one very long vector $Y=(Y_1^T,\dots,Y_m^T)^T$. Similarly, $X$ would have $\sum_{i=1}^{m}n_i$ rows and $d$ columns. But then $W$ needs to be redefined as the driect sum $W=\oplus_{i=1}^{m}W_i$. But then $W$ now has structural zeros. Too complicated to work with...
don't use indices more than once. For example you use $w_{j,k}$ and also use $j,k$ as summation variables. Should use $w_{r,s}$ instead

So I would write your cost function as $$J=\frac{1}{2}\sum_{i=1}^{m}(X_iW_i-Y_i)^T (X_iW_i-Y_i) +\lambda (W_i-\overline{W})^T (W_i-\overline{W})$$ Where $\overline{W}=\frac{1}{m}\sum_{i=1}^{m}W_i$ Now using the chain rule we have $\frac{\partial e^Te}{\partial W_r}= 2\frac{\partial e}{\partial W_r} e$

So you have $$ \frac{\partial J}{\partial W_r} =X_r^T (X_rW_r-Y_r) +2\lambda \sum_{i=1}^{m} \left(\frac{\partial W_i}{\partial W_r}- \frac{\partial \overline{W} }{\partial W_r}\right)(W_i-\overline{W})$$ $$ =X_r^T (X_rW_r-Y_r) +2\lambda (W_r-\overline{W})$$

This is not the answer you have.

update

One way you can re-express the equations is by setting $ X=\oplus_{i=1}^mX_i $ (which has $\sum_{i=1}^mn_i $ rows and $ dm $ colums, and $ w=(W_1^T,\dots, W_m^T)^T $ and $ Y=(Y_1^T,\dots, Y_m^T)^T $. We can also re-express the penalty term as $\sum_{i=1}^m(W_i-\overline {W})^T (W_i-\overline {W})=w^Tw-m\overline {W}^T\overline {W} =w^T (I-m^{-1}G^TG) w $ where $ G$ is the $ d\times md $ matrix which calculates the totals for $ w $. So the $ k $ th row of $ G $ has ones in columns $k, d+k, 2d+k, \dots, (m-1) d+k $ and zeroes everywhere else. We can also write the other factor as $\frac{1}{2}(Y-Xw)^T (Y-Xw) $. Hence an explicit solution is given as

$$\hat {w}=\left [X^TX +2\lambda (I-m^{-1} G^TG)\right]^{-1} X^TY $$

It will probably be more efficient to implement by first use the woodbury matrix identity, as $ X^TX $ is a block- diagonal matrix.

Best Answer

Related Solutions

Solved – Gradient descent applied to softmax regression

Solved – Derivation of Perceptron weight update formula

Related Question