Solved – How to plot logistic decision boundary

logisticmachine learning

I am running logistic regression on a small dataset which looks like this:

enter image description here

After implementing gradient descent and the cost function, I am getting a 100% accuracy in the prediction stage, However I want to be sure that everything is in order so I am trying to plot the decision boundary line which separates the two datasets.

Below I present plots showing the cost function and theta parameters. As can be seen, currently I am printing the decision boundary line incorrectly.

enter image description here

Extracting data

clear all; close all; clc;

alpha = 0.01;
num_iters = 1000;

%% Plotting data
x1 = linspace(0,3,50);
mqtrue = 5;
cqtrue = 30;
dat1 = mqtrue*x1+5*randn(1,50);

x2 = linspace(7,10,50);
dat2 = mqtrue*x2 + (cqtrue + 5*randn(1,50));

x = [x1 x2]'; % X

subplot(2,2,1);
dat = [dat1 dat2]'; % Y

scatter(x1, dat1); hold on;
scatter(x2, dat2, '*'); hold on;
classdata = (dat>40);

Computing Cost, Gradient and plotting

%  Setup the data matrix appropriately, and add ones for the intercept term
[m, n] = size(x);

% Add intercept term to x and X_test
x = [ones(m, 1) x];

% Initialize fitting parameters
theta = zeros(n + 1, 1);
%initial_theta = [0.2; 0.2];

J_history = zeros(num_iters, 1); 

plot_x = [min(x(:,2))-2,  max(x(:,2))+2]

for iter = 1:num_iters 
% Compute and display initial cost and gradient
    [cost, grad] = logistic_costFunction(theta, x, classdata);
    theta = theta - alpha * grad;
    J_history(iter) = cost;

    fprintf('Iteration #%d - Cost = %d... \r\n',iter, cost);


    subplot(2,2,2);
    hold on; grid on;
    plot(iter, J_history(iter), '.r');  title(sprintf('Plot of cost against number of iterations. Cost is %g',J_history(iter)));
    xlabel('Iterations')
    ylabel('MSE')
    drawnow

    subplot(2,2,3);
    grid on;
    plot3(theta(1), theta(2), J_history(iter),'o')
    title(sprintf('Tita0 = %g, Tita1=%g', theta(1), theta(2)))
    xlabel('Tita0')
    ylabel('Tita1')
    zlabel('Cost')
    hold on;
    drawnow

    subplot(2,2,1);
    grid on;    
    % Calculate the decision boundary line
    plot_y = theta(2).*plot_x + theta(1);  % <--- Boundary line 
    % Plot, and adjust axes for better viewing
    plot(plot_x, plot_y)
    hold on;
    drawnow

end

fprintf('Cost at initial theta (zeros): %f\n', cost);
fprintf('Gradient at initial theta (zeros): \n');
fprintf(' %f \n', grad);

The above code is implementing gradient descent correctly (I think) but I am still unable to show the boundary line plot. Any suggestions would be appreciated.

logistic_costFunction

function [J, grad] = logistic_costFunction(theta, X, y)

    % Initialize some useful values
    m = length(y); % number of training examples

    grad = zeros(size(theta));

    h = sigmoid(X * theta);
    J = -(1 / m) * sum( (y .* log(h)) + ((1 - y) .* log(1 - h)) );

    for i = 1 : size(theta, 1)
        grad(i) = (1 / m) * sum( (h - y) .* X(:, i) );
    end

end

Best Answer

You have a set of $(x_1,x_2)$ points which belong to two different classes. You'll enumerate the two classes as $0$ and $1$, which will be your target outputs, i.e. $y$'s. In your implementation, you use $y=x_2$, which isn't correct. For example, the cross entropy you use assumes that $y$ is either $0$ or $1$. Therefore, since you have a 2D dataset, your boundary line will be in the form $\theta_0+\theta_1x_1+\theta_2x_2=0$, where you use both coordinates. While classifying you'll compare your logistic regression output, $h=\sigma(\theta^Tx)$ , with some threshold, $\tau$, commonly (but not necessarily) chosen as $\tau=0$ in balanced datasets like this, and decide class $1$ for $h>1/2$, $0$ otherwise. This is what a logistic regression is in a nutshell.

Note: In your dataset, the samples can be perfectly classified via only $x_1$, with some threshold, e.g. $x_1>\tau\rightarrow$ class $1$, but assume you don't have this information and you want to use all the features for simplicity. This is a specific situatuon to your dataset. However, even in this case, your $y$'s will be again $0$ and $1$, but nothing else.

Edit: Here, I'm making some modifications to make your code work. First of all, you do the training with only $x_1$ coordinate, in which you get estimates for only $\theta_1,\theta_0$. Since your data is 2D, as I've explained above, you need three variables, i.e. $\theta_2,\theta_1,\theta_0$, because you actually seek for a boundary equation in 2D plane. The reason you see decrease in your cost is the fact that your data is actually separable only with $x_1$ coordinate. This isn't wrong, but contradicts your aim to find a separating boundary line in x-y plane. If you use only $x_1$, you'll get a boundary equation in the form of a vertical line: $\theta_1x_1+\theta_0=0$, which is why your plot_y variable is oscillating around $0$, because you seem to calculate the RHS estimate of this equation in each iteration. The correct thing to do is using both $x_1$ and $x_2$ as your features, and predicting $y$, via estimates of $\theta_{2-0}$.

Another practical difficulty here is the scales of your features. $x_2$ has a considerably larger scale than $x_1$ and if you use SGD with constant learning rate as here, you need to choose a very small one to suit all, or use different learning rates for each feature. Better, use feature scaling. I've implemented it in your code:

clear all; close all; clc;

alpha = 0.1;
num_iters = 1000;

%% Plotting data
x1 = linspace(0,3,50);
mqtrue = 5;
cqtrue = 30;
dat1 = mqtrue*x1+5*randn(1,50);

x2 = linspace(7,10,50);
dat2 = mqtrue*x2 + (cqtrue + 5*randn(1,50));

x = [x1 x2]'; % X

dat = [dat1 dat2]'; % Y
x = [x dat];

scatter(x1, dat1); hold on;
scatter(x2, dat2, '*'); hold on;
classdata = (dat>40);

%  Setup the data matrix appropriately, and add ones for the intercept term
[m, n] = size(x);

minX = min(x);
maxX = max(x);

x = (x - repmat(minX,m,1)) ./ repmat(maxX-minX,m,1);
plot_x = [min(x(:,2)),  max(x(:,2))];

% Add intercept term to x and X_test
x = [ones(m, 1) x];

% Initialize fitting parameters
theta = zeros(n + 1, 1);
%initial_theta = [0.2; 0.2];

J_history = zeros(num_iters, 1); 

% alpha = [0.1 0.01 0.001]';
for iter = 1:num_iters 
% Compute and display initial cost and gradient
    [cost, grad] = logistic_costFunction(theta, x, classdata);
    theta = theta - alpha .* grad;
    J_history(iter) = cost;

    fprintf('Iteration #%d - Cost = %d... \r\n',iter, cost);
end


plot_y = -(theta(2)*plot_x + theta(1)) / theta(3);  % <--- Boundary line 
plot((plot_x*(maxX(1)-minX(1))+minX(1)), plot_y*(maxX(2)-minX(2))+minX(2))

fprintf('Cost at initial theta (zeros): %f\n', cost);
fprintf('Gradient at initial theta (zeros): \n');
fprintf(' %f \n', grad);

Note that, the correct boundary line is: $$x_2=-\frac{\theta_1x_1+\theta_0}{\theta_2}$$ And, the output is: enter image description here