Doubt about how exactly was calculated this gradient descent cost function using Octave\MatLab. How is it exactly working

I am following a machine learning course on Coursera and I am doing the following exercise using Octave (MatLab should be the same).

The excercise is related to the calculation of the cost function for a gradiend descent algoritm.

In the course slide I have that this is the cost function that I have to implement using Octave:

This is the formula from the course slide:

So J is a function of some THETa variables tappresented by the THETA matrix (in the previous second equation).

This is the correc MatLab\Octave implementation for the J(THETA) computation:

function J = computeCost(X, y, theta)
%COMPUTECOST Compute cost for linear regression
%   J = COMPUTECOST(X, y, theta) computes the cost of using theta as the
%   parameter for linear regression to fit the data points in X and y

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly 
J = 0;

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta
%               You should set J to the cost.

J = (1/(2*m))*sum(((X*theta) - y).^2)

% =========================================================================

end

where:

X is a 2 column matrix of m rows having all the elements of the first column set to the value 1:

X =

1.0000    6.1101
1.0000    5.5277
1.0000    8.5186
......    ......
......    ......
......    ......

y is a vector of m elements (as X):

y =

   17.59200
    9.13020
   13.66200
   ........
   ........
   ........

Finnally theta is a 2 columns vector having 0 asvalues like this:

theta = zeros(2, 1); % initialize fitting parameters
theta
theta =

   0
   0

Ok, coming back to my working solution:

J = (1/(2*m))*sum(((X*theta) - y).^2)

specifically to this mtrix multiplication (multiplication between the matix X and the vector theta): I know that it is a valid matrix multiplication because the numer of column of X (2 columns) is equal to the number of rows of theta (2 rows) so it is a perfectly valid matrix multiplication.

My doubt that is driving me craxy (probably it is a trivial doubt) is related to the previous course slide context:

As you can see in the second equation used to calculated the current h_theta(x) value it is using the transposed theta vector and not the theta vector as done in the code.

Why ?!?!

I suspect that it depends only on how was created the theta vector. It was build in this way:

theta = zeros(2, 1); % initialize fitting parameters

that is generating a 2 line 1 column vector instead a classic one line 2 column vector. So maybe I have not to transpose it. But I am absolutly not sure about this assertion.

It my intuition correct or what am I missing?

Best Answer

The way you created the theta vector can be used in both cases. The important difference, and why the course slide used the transpose of $\theta$ and you didn't in the function is that the order of operations is the opposite, with you using multiplication and the slide function using dot product. To see this, first consider what you've done. In this case, $X$ is a $m \times 2$ matrix, $y$ is $m \times 1$ matrix and $\theta$ is $2 \times 1$ matrix. As such, $X \theta$ is the multiplication of a $m \times 2$ matrix and a $2 \times 1$ matrix, resulting in a $m \times 1$ matrix. This is important because it matches the dimensions of $y$. As such, the resulting request of "X*theta - y" can be done properly, resulting in another $m \times 1$ matrix, that is further manipulated by squaring each term, summing all of the terms and then dividing by $2m$.

As implemented in your function, for each row $x_i$, you are multiplying that row with $\theta$ to get, via matrix multiplication,

$$x_i \times \theta = \left(x_{i0} \; x_{i1}\right) \times {\theta_0 \choose \theta_1} = \left(x_{i0}\theta_0 + x_{i1}\theta_1\right) \tag{1}\label{eq1}$$

In the course slide, note that $\theta^T$ is the transpose, so it's a $1 \times 2$ matrix. Thus, for a row $x_i$, note that $\theta^T x_i$ becomes, if matrix multiplication is used again,

$$\theta^T \times x_i = \left(\theta_0 \; \theta_0\right) \times \left(x_{i0} \; x_{i1}\right) \tag{2}\label{eq2}$$

This doesn't work with matrix multiplication as the # of rows of the left multiplicand, i.e., $1$, doesn't match the # of columns of the right multiplicand, i.e., $2$. However, it does work as a dot product, with the RHS of \eqref{eq2} then instead becoming

$$\left(\theta_0 \; \theta_0\right) \cdot \left(x_{i0} \; x_{i1}\right) = \theta_0 x_{i0} + \theta_1 x_{i1} \tag{3}\label{eq3}$$

This is basically the same as \eqref{eq1} in terms of the resulting value. I believe the slide's $h_{\theta}$ function is implemented this way as the result of a dot product is defined to be a scalar, while technically the result of the matrix multiplication in \eqref{eq1} is actually a $1 \times 1$ matrix, which is why I show it using brackets. Regardless, in either case, to match the slide's $h_{\theta}$ function value requires that $x_{i0} = 1$ for all $i$, as is done & noted in your question text.

As you can see, what you're doing is basically equivalent to what the slide function proposes in terms of resulting numeric value. However, I assume it's likely simpler and easier to use matrix multiplication in the MatLab/Octave code, as you can use just one line & let the program environment handle the details of the matrix manipulations behind the scenes.

If you really want your code to match the slide's function more closely, you could of course define $\theta$ as a $1 \times 2$ matrix instead and use its transpose in the code, although you need to keep the same order of operations. Alternatively, to completely match what the slide has, you need to use not only $\theta^T$, but also have the code do a dot product instead. One final way is to have, in \eqref{eq2}, the $x_i$ matrix be of size $2 \times 1$ instead of $1 \times 2$, so matrix multiplication would then work. However, I believe this would complicate various processing, so I wouldn't recommend implementing it this way.

Best Answer

Related Solutions

Related Question