Compute the function of squared error gradient

derivativesgradient descentvector analysis

For gradient descent, the squared error function is the following:

$$L(x_i,y_i, \beta) = (y_i – x_i.\beta)^2$$

I am looking for the gradient with respect to $\beta$.

According to the rule $\frac{\delta}{\delta x}f(g(x)) = f'(g(x))\times g'(x)$:

\begin{align*}
\frac{\delta L(x_i,y_i, \beta)}{\delta\beta} &= 2.(y_i – x_i.\beta).(-x_i)\\
\frac{\delta L(x_i,y_i, \beta)}{\delta\beta} &= -2x_i(y_i-x_i\beta)
\end{align*}

So, when trying his function for gradient descent, why does Joel Grus write:

def predict(x_i, beta):
    """assumes that the first element of each x_i is 1"""
    return np.dot(x_i, beta)

def error(x_i, y_i, beta):
    return y_i - predict(x_i, beta)

def squared_error_gradient(x_i, y_i, beta):
    """The gradient (with respect to beta)
    correspond to the ith squared error term"""
    return [-2 * x_ij * error(x_i, y_i, beta)
           for x_ij in x_i]

$$[-2x_{ij}(y_i-x_i\beta) \ for \ x_{ij} \ in \ x_i]$$

Which I guess is actually:

$$= \sum_{j=1}^{len(x_i)} -2x_{ij}(y_i-x_i\beta)$$

Best Answer

There seems to be a bit of a confusion about the nature of the objects $x_i$ and $\beta$, those are vectors, e.g. $x_i = (x_{i}^{1}, x_{i}^2, \ldots, x_i^j, \ldots)^T$ and $\beta = (\beta_1, \beta_2, \cdots, \beta_j,\cdots)^T$. Where $x_i^j$ means the $j$-th component of the vector $x_i$

So the error for that observation is

$$ L(x_i, y_i, \beta) = (y_i - x_i\cdot \beta)^2 = \left(y_i - x_i^1\beta_1 - x_i^2\beta_2 - \cdots x_i^j\beta_j - \cdots \right)^2 \tag{1} $$

Now comes the gradient, that is also a vector, the $j$-th component is

$$ \frac{\partial L}{\partial \beta_j}(x_i, y_i, \beta) = \left(y_i - x_i^1\beta_1 - x_i^2\beta_2 - \cdots x_i^j\beta_j - \cdots \right) (-2 x_{i}^j) = -2 x_i^j (y_i - x_i\cdot\beta) \tag{2} $$

The actual gradient is just

$$ \frac{\partial L}{\partial \beta} = \left(\frac{\partial L}{\partial \beta_1}, \frac{\partial L}{\partial \beta_2}, \ldots, \frac{\partial L}{\partial \beta_j}\ldots\right)^T \tag{3} $$

which is the array in the reference you cite