Short answer
In general this is not true. Consider $N-1$ points in $\mathbb{R}^N$ and a line $d$. Place the $N-1$ points any way you want in $d$. Now, the intersection of any affine hyperplane of $\mathbb{R}^N$ with $d$ will either be empty, a point, or the whole line. Any way, if $N-1 \ge 3$, not all classifications will be possible.
Longer answer
This being said, let's get to the interesting part. It is also always possible to place $N-1$ points in $\mathbb{R}^N$ such that they can be classified in any way you want by an affine hyperplane. In fact, you can place up to $N+1$ points in $\mathbb{R}^N$ and still have that property.
To prove this, we'll explicity place $N+1$ points in $\mathbb{R}^N$ in a particular way and count the number of ways we can classify those points. If we get $2^{N+1}$ ways of classifying them, it will mean that we can make any classification we want. (Make sure you see why.)
I say that the points
$$ x_1 = (1^1, \dots, 1^N), x_2 = (2^1, \dots, 2^N), \dots, x_{N+1} = ((N+1)^1, \dots, (N+1)^N) $$
will do the job.
Recall that our classification function is $sign(h(x_i))$, where $h(x_i)$ is the dot product between a weight vector $w=(w_0, w_1, \dots, w_N)$ and our vector $x_i$ augmented with a new coordinate fixed to $1$. That is, for $< , >$ the dot product,
$$
h(x_i) = <(w_0, w_1, \dots, w_N), (1, x_{i_1}, \dots, x_{i_N}) >
$$
If we let $P$ be a polynomial of $\mathbb{R}[X]$ and say $P = w_0 + w_1 X + \dots + w_N X^N$, then we find
$$
h(x_i) = P(i)
$$
Now, fix $h(x_1) = P(1) >0$ (you could also say $P(1) < 0$ ). Choose any $d \le N$ and $u_1, \dots u_d \in \{2, 3, \dots, N+1\}$ such that $u_1 < u_2 < \dots < u_d$. We can choose $P$ (that is, choose the weights $w_i$ using polynomial interpolation) such that
\begin{array}{l}
P(1) > 0, \dots P(u_1 -1 ) >0\\
P(u_1) < 0, \dots P(u_2 -1) < 0\\
P(u_2) > 0, \dots P(u_3 -1 ) >0\\
\vdots\\
(-1)^d P(u_d) >0 , \dots, (-1)^d P(u_N) >0
\end{array}
We thus have an alternance in our classification over the intervals $[1, u_1), [u_1, u_2), \dots, [u_d, N]$.
It's now time to count. For two choices of $d$, the classifications are disjoint. We had two choices for the sign of $P(1)$, and ${N \choose d}$ possibilities for the $u_1, \dots u_d$. Therefore, the we can classify our $N+1$ points in (at least)
$$
2\cdot \sum\limits_{i=0}^{N} {N\choose i} = 2^{N+1}
$$
ways. We won!
Generalisation
I went through that long argument in order to get this number : $2\cdot \sum\limits_{i=0}^{N} {N\choose i}$. It turns out that $N$ points in $\mathbb{R}^d$ can be chosen so they can be classified in
$$
m(N) := 2\cdot \sum\limits_{i=0}^{d} {N-1\choose i}
$$ ways. Furthermore, no $N$ points will have more than $m(N)$ classifications.
You can use the proof we made in order to build $N$ points that can be classified in $m(N)$ ways. I found it is harder to prove that $m(N)$ is also a maximum (I'd be interested in a simple proof, if someone has one.)
Note: I think I got carried away.
You have a set of $(x_1,x_2)$ points which belong to two different classes. You'll enumerate the two classes as $0$ and $1$, which will be your target outputs, i.e. $y$'s. In your implementation, you use $y=x_2$, which isn't correct. For example, the cross entropy you use assumes that $y$ is either $0$ or $1$. Therefore, since you have a 2D dataset, your boundary line will be in the form $\theta_0+\theta_1x_1+\theta_2x_2=0$, where you use both coordinates. While classifying you'll compare your logistic regression output, $h=\sigma(\theta^Tx)$ , with some threshold, $\tau$, commonly (but not necessarily) chosen as $\tau=0$ in balanced datasets like this, and decide class $1$ for $h>1/2$, $0$ otherwise. This is what a logistic regression is in a nutshell.
Note: In your dataset, the samples can be perfectly classified via only $x_1$, with some threshold, e.g. $x_1>\tau\rightarrow$ class $1$, but assume you don't have this information and you want to use all the features for simplicity. This is a specific situatuon to your dataset. However, even in this case, your $y$'s will be again $0$ and $1$, but nothing else.
Edit: Here, I'm making some modifications to make your code work. First of all, you do the training with only $x_1$ coordinate, in which you get estimates for only $\theta_1,\theta_0$. Since your data is 2D, as I've explained above, you need three variables, i.e. $\theta_2,\theta_1,\theta_0$, because you actually seek for a boundary equation in 2D plane. The reason you see decrease in your cost is the fact that your data is actually separable only with $x_1$ coordinate. This isn't wrong, but contradicts your aim to find a separating boundary line in x-y plane. If you use only $x_1$, you'll get a boundary equation in the form of a vertical line: $\theta_1x_1+\theta_0=0$, which is why your plot_y
variable is oscillating around $0$, because you seem to calculate the RHS estimate of this equation in each iteration. The correct thing to do is using both $x_1$ and $x_2$ as your features, and predicting $y$, via estimates of $\theta_{2-0}$.
Another practical difficulty here is the scales of your features. $x_2$ has a considerably larger scale than $x_1$ and if you use SGD with constant learning rate as here, you need to choose a very small one to suit all, or use different learning rates for each feature. Better, use feature scaling. I've implemented it in your code:
clear all; close all; clc;
alpha = 0.1;
num_iters = 1000;
%% Plotting data
x1 = linspace(0,3,50);
mqtrue = 5;
cqtrue = 30;
dat1 = mqtrue*x1+5*randn(1,50);
x2 = linspace(7,10,50);
dat2 = mqtrue*x2 + (cqtrue + 5*randn(1,50));
x = [x1 x2]'; % X
dat = [dat1 dat2]'; % Y
x = [x dat];
scatter(x1, dat1); hold on;
scatter(x2, dat2, '*'); hold on;
classdata = (dat>40);
% Setup the data matrix appropriately, and add ones for the intercept term
[m, n] = size(x);
minX = min(x);
maxX = max(x);
x = (x - repmat(minX,m,1)) ./ repmat(maxX-minX,m,1);
plot_x = [min(x(:,2)), max(x(:,2))];
% Add intercept term to x and X_test
x = [ones(m, 1) x];
% Initialize fitting parameters
theta = zeros(n + 1, 1);
%initial_theta = [0.2; 0.2];
J_history = zeros(num_iters, 1);
% alpha = [0.1 0.01 0.001]';
for iter = 1:num_iters
% Compute and display initial cost and gradient
[cost, grad] = logistic_costFunction(theta, x, classdata);
theta = theta - alpha .* grad;
J_history(iter) = cost;
fprintf('Iteration #%d - Cost = %d... \r\n',iter, cost);
end
plot_y = -(theta(2)*plot_x + theta(1)) / theta(3); % <--- Boundary line
plot((plot_x*(maxX(1)-minX(1))+minX(1)), plot_y*(maxX(2)-minX(2))+minX(2))
fprintf('Cost at initial theta (zeros): %f\n', cost);
fprintf('Gradient at initial theta (zeros): \n');
fprintf(' %f \n', grad);
Note that, the correct boundary line is:
$$x_2=-\frac{\theta_1x_1+\theta_0}{\theta_2}$$
And, the output is:
Best Answer
Comments to the question suggest the following interpretation:
The answer is yes, by construction.
Let $|\ |$ be the usual Euclidean distance. Its square is a quadratic polynomial. Specifically, using any orthogonal coordinate system write $\mathbf{x}=(x_1,\ldots, x_n)$ and $\mathbf{y}=(y_1,\ldots, y_n).$ We have
$$|\mathbf{x}-\mathbf{y}|^2 = \sum_{i=1}^n (x_i-y_i)^2,$$
which explicitly is a quadratic polynomial function of the coordinates.
Define $$f_{A,B}(\mathbf x)=\left[\sum_{\mathbf y\in A}\frac{1}{|\mathbf x-\mathbf y|^2}-\sum_{\mathbf y\in B}\frac{1}{|\mathbf x-\mathbf y|^2}\right]\prod_{\mathbf y\in A\cup B}|\mathbf x-\mathbf y|^2.$$
Notice how $f_{A,B}$ is defined as a product. The terms on the right hand side clear the denominators of the fractions on the left, showing that $f$ is actually defined everywhere on $E^n$ and is a polynomial function.
The function in the left term of the product has poles (explodes to $\pm \infty$) precisely at the data points $\mathbf x \in A\cup B.$ At the points of $A$ its values diverge to $+\infty$ and at the points of $B$ its values diverge to $-\infty.$ Because the product at the right is non-negative, we see that in a sufficiently small neighborhood of $A$ $f_{A,B}$ is always positive and in a sufficiently small neighborhood of $B$ $f_{A,B}$ is always negative. Thus $f_{A,B}$ does its job of separating $A$ from $B,$ QED.
Here is an illustration showing the contour $f_{A,B}=0$ for $80$ randomly selected points in the plane $E^2.$ Of these, $43$ were randomly selected to form the subset $A$ (drawn as blue triangles) and others form the subset $B,$ drawn as red circles. You can see this construction works because all blue triangles fall within the gray (positive) region where $f_{A,B}\gt 0$ and all the red circles fall within the interior of its complement where $f_{A,B}\lt 0.$
To see more examples, modify and run this
R
script that produced the figure. Its functionf
, defined at the outset, implements the construction of $f_{A,B}.$