Regression – Determining if Any Data Can Be Learned Using Polynomial Logistic Regression in Machine Learning

classificationlogisticmachine learningpolynomialregression

We know that a Taylor polynomial can approximate any continuous function. As @DemetriPananos noticed, Logistic regression seeks to estimate the coefficients for a model and any cut off is imposed post facto. But suppose there's a best possible decision boundary for our data. By "best possible" I mean a decision boundary that perfectly separates two classes.

Assume, for the sake of simplicity, that there is no data points from positive class that overlay data points from negative class (as @Sycorax suggested). For example consider this plot:

The blue line perfectly separates two classes. But the blue line itself doesn't represent a function.

If we were to increase a degree of polynomial in our logistic regression, can we be sure that such a perfect decision boundary would be found for any data that can be perfectly separated?
If the answer to my first question is "yes", then how to prove (or show) it?

Best Answer

Comments to the question suggest the following interpretation:

Given any two non-overlapping finite collections of points $A$ and $B$ in a Euclidean space $E^n,$ does there always exist a polynomial function $f_{A,B}:E^n\to\mathbb R$ that perfectly separates the collections? That is, $f_{A,B}$ has positive values on all points of $A$ and negative values on all points of $B.$

The answer is yes, by construction.

Let $|\ |$ be the usual Euclidean distance. Its square is a quadratic polynomial. Specifically, using any orthogonal coordinate system write $\mathbf{x}=(x_1,\ldots, x_n)$ and $\mathbf{y}=(y_1,\ldots, y_n).$ We have

$$|\mathbf{x}-\mathbf{y}|^2 = \sum_{i=1}^n (x_i-y_i)^2,$$

which explicitly is a quadratic polynomial function of the coordinates.

Define $$f_{A,B}(\mathbf x)=\left[\sum_{\mathbf y\in A}\frac{1}{|\mathbf x-\mathbf y|^2}-\sum_{\mathbf y\in B}\frac{1}{|\mathbf x-\mathbf y|^2}\right]\prod_{\mathbf y\in A\cup B}|\mathbf x-\mathbf y|^2.$$

Notice how $f_{A,B}$ is defined as a product. The terms on the right hand side clear the denominators of the fractions on the left, showing that $f$ is actually defined everywhere on $E^n$ and is a polynomial function.

The function in the left term of the product has poles (explodes to $\pm \infty$) precisely at the data points $\mathbf x \in A\cup B.$ At the points of $A$ its values diverge to $+\infty$ and at the points of $B$ its values diverge to $-\infty.$ Because the product at the right is non-negative, we see that in a sufficiently small neighborhood of $A$ $f_{A,B}$ is always positive and in a sufficiently small neighborhood of $B$ $f_{A,B}$ is always negative. Thus $f_{A,B}$ does its job of separating $A$ from $B,$ QED.

Here is an illustration showing the contour $f_{A,B}=0$ for $80$ randomly selected points in the plane $E^2.$ Of these, $43$ were randomly selected to form the subset $A$ (drawn as blue triangles) and others form the subset $B,$ drawn as red circles. You can see this construction works because all blue triangles fall within the gray (positive) region where $f_{A,B}\gt 0$ and all the red circles fall within the interior of its complement where $f_{A,B}\lt 0.$

To see more examples, modify and run this R script that produced the figure. Its function f, defined at the outset, implements the construction of $f_{A,B}.$

#
# The columns of `A` are all data points.  The values of `I` are +/-1, indicating
# the subset each column belongs to.
#
f <- function(x, A, I) {
  d2 <- colSums((A-x)^2)
  j <- d2 == 0           # At most one point, assuming all points in `A` are unique
  if (sum(j) > 0)        # Avoids division by zero
    return(prod(d2[!j]) * prod(I[j])) 
  sum(I / d2) * prod(d2)
}
#
# Create random points and a random binary classification of them.
#
# set.seed(17)
d <- 2   # Dimensions                 
n <- 80  # total number of points
p <- 1/2 # Expected Fraction in `A`
A <- matrix(runif(d*n), d)
I <- sample(c(-1,1), ncol(A), replace=TRUE, prob=c(1-p, p))
#
# Check `f` by applying it to the data points and confirming it gives the
# correct signs.
#
I. <- sign(apply(A, 2, f, A=A, I=I))
if (!isTRUE(all.equal(I, I.))) stop("f does not work...")
#
# For plotting, compute values of `f` along a slice through the space.
#
slice <- rep(1/2, d-2) # Choose which slice to plot
X <- Y <- seq(-0.2, 1.2, length.out=201)
Z <- matrix(NA_real_, length(X), length(Y))
for (i in seq_along(X)) for (j in seq_along(Y)) 
    Z[i, j] <- f(c(X[i], Y[j], slice), A, I)
#
# Display a 2D plot.
#
image(X, Y, sign(Z), col=c("Gray", "White"), xaxt="n", yaxt="n", asp=1, bty="n",
      main="Polynomial separator of random points")
contour(X, Y, Z, levels=0, labels="", lwd=2, labcex=0.001, add=TRUE)
points(t(A), pch=ifelse(I==1, 19, 17), col=ifelse(I==1, "Red", "Blue"))

Short answer

In general this is not true. Consider $N-1$ points in $\mathbb{R}^N$ and a line $d$. Place the $N-1$ points any way you want in $d$. Now, the intersection of any affine hyperplane of $\mathbb{R}^N$ with $d$ will either be empty, a point, or the whole line. Any way, if $N-1 \ge 3$, not all classifications will be possible.

Longer answer

This being said, let's get to the interesting part. It is also always possible to place $N-1$ points in $\mathbb{R}^N$ such that they can be classified in any way you want by an affine hyperplane. In fact, you can place up to $N+1$ points in $\mathbb{R}^N$ and still have that property.

To prove this, we'll explicity place $N+1$ points in $\mathbb{R}^N$ in a particular way and count the number of ways we can classify those points. If we get $2^{N+1}$ ways of classifying them, it will mean that we can make any classification we want. (Make sure you see why.)

I say that the points $$ x_1 = (1^1, \dots, 1^N), x_2 = (2^1, \dots, 2^N), \dots, x_{N+1} = ((N+1)^1, \dots, (N+1)^N) $$ will do the job.

Recall that our classification function is $sign(h(x_i))$, where $h(x_i)$ is the dot product between a weight vector $w=(w_0, w_1, \dots, w_N)$ and our vector $x_i$ augmented with a new coordinate fixed to $1$. That is, for $< , >$ the dot product, $$ h(x_i) = <(w_0, w_1, \dots, w_N), (1, x_{i_1}, \dots, x_{i_N}) > $$ If we let $P$ be a polynomial of $\mathbb{R}[X]$ and say $P = w_0 + w_1 X + \dots + w_N X^N$, then we find $$ h(x_i) = P(i) $$

Now, fix $h(x_1) = P(1) >0$ (you could also say $P(1) < 0$ ). Choose any $d \le N$ and $u_1, \dots u_d \in \{2, 3, \dots, N+1\}$ such that $u_1 < u_2 < \dots < u_d$. We can choose $P$ (that is, choose the weights $w_i$ using polynomial interpolation) such that \begin{array}{l} P(1) > 0, \dots P(u_1 -1 ) >0\\ P(u_1) < 0, \dots P(u_2 -1) < 0\\ P(u_2) > 0, \dots P(u_3 -1 ) >0\\ \vdots\\ (-1)^d P(u_d) >0 , \dots, (-1)^d P(u_N) >0 \end{array}

We thus have an alternance in our classification over the intervals $[1, u_1), [u_1, u_2), \dots, [u_d, N]$.

It's now time to count. For two choices of $d$, the classifications are disjoint. We had two choices for the sign of $P(1)$, and ${N \choose d}$ possibilities for the $u_1, \dots u_d$. Therefore, the we can classify our $N+1$ points in (at least) $$ 2\cdot \sum\limits_{i=0}^{N} {N\choose i} = 2^{N+1} $$ ways. We won!

Generalisation

I went through that long argument in order to get this number : $2\cdot \sum\limits_{i=0}^{N} {N\choose i}$. It turns out that $N$ points in $\mathbb{R}^d$ can be chosen so they can be classified in $$ m(N) := 2\cdot \sum\limits_{i=0}^{d} {N-1\choose i} $$ ways. Furthermore, no $N$ points will have more than $m(N)$ classifications.

You can use the proof we made in order to build $N$ points that can be classified in $m(N)$ ways. I found it is harder to prove that $m(N)$ is also a maximum (I'd be interested in a simple proof, if someone has one.)

Note: I think I got carried away.

Solved – How to plot logistic decision boundary

You have a set of $(x_1,x_2)$ points which belong to two different classes. You'll enumerate the two classes as $0$ and $1$, which will be your target outputs, i.e. $y$'s. In your implementation, you use $y=x_2$, which isn't correct. For example, the cross entropy you use assumes that $y$ is either $0$ or $1$. Therefore, since you have a 2D dataset, your boundary line will be in the form $\theta_0+\theta_1x_1+\theta_2x_2=0$, where you use both coordinates. While classifying you'll compare your logistic regression output, $h=\sigma(\theta^Tx)$ , with some threshold, $\tau$, commonly (but not necessarily) chosen as $\tau=0$ in balanced datasets like this, and decide class $1$ for $h>1/2$, $0$ otherwise. This is what a logistic regression is in a nutshell.

Note: In your dataset, the samples can be perfectly classified via only $x_1$, with some threshold, e.g. $x_1>\tau\rightarrow$ class $1$, but assume you don't have this information and you want to use all the features for simplicity. This is a specific situatuon to your dataset. However, even in this case, your $y$'s will be again $0$ and $1$, but nothing else.

Edit: Here, I'm making some modifications to make your code work. First of all, you do the training with only $x_1$ coordinate, in which you get estimates for only $\theta_1,\theta_0$. Since your data is 2D, as I've explained above, you need three variables, i.e. $\theta_2,\theta_1,\theta_0$, because you actually seek for a boundary equation in 2D plane. The reason you see decrease in your cost is the fact that your data is actually separable only with $x_1$ coordinate. This isn't wrong, but contradicts your aim to find a separating boundary line in x-y plane. If you use only $x_1$, you'll get a boundary equation in the form of a vertical line: $\theta_1x_1+\theta_0=0$, which is why your plot_y variable is oscillating around $0$, because you seem to calculate the RHS estimate of this equation in each iteration. The correct thing to do is using both $x_1$ and $x_2$ as your features, and predicting $y$, via estimates of $\theta_{2-0}$.

Another practical difficulty here is the scales of your features. $x_2$ has a considerably larger scale than $x_1$ and if you use SGD with constant learning rate as here, you need to choose a very small one to suit all, or use different learning rates for each feature. Better, use feature scaling. I've implemented it in your code:

clear all; close all; clc;

alpha = 0.1;
num_iters = 1000;

%% Plotting data
x1 = linspace(0,3,50);
mqtrue = 5;
cqtrue = 30;
dat1 = mqtrue*x1+5*randn(1,50);

x2 = linspace(7,10,50);
dat2 = mqtrue*x2 + (cqtrue + 5*randn(1,50));

x = [x1 x2]'; % X

dat = [dat1 dat2]'; % Y
x = [x dat];

scatter(x1, dat1); hold on;
scatter(x2, dat2, '*'); hold on;
classdata = (dat>40);

%  Setup the data matrix appropriately, and add ones for the intercept term
[m, n] = size(x);

minX = min(x);
maxX = max(x);

x = (x - repmat(minX,m,1)) ./ repmat(maxX-minX,m,1);
plot_x = [min(x(:,2)),  max(x(:,2))];

% Add intercept term to x and X_test
x = [ones(m, 1) x];

% Initialize fitting parameters
theta = zeros(n + 1, 1);
%initial_theta = [0.2; 0.2];

J_history = zeros(num_iters, 1); 

% alpha = [0.1 0.01 0.001]';
for iter = 1:num_iters 
% Compute and display initial cost and gradient
    [cost, grad] = logistic_costFunction(theta, x, classdata);
    theta = theta - alpha .* grad;
    J_history(iter) = cost;

    fprintf('Iteration #%d - Cost = %d... \r\n',iter, cost);
end


plot_y = -(theta(2)*plot_x + theta(1)) / theta(3);  % <--- Boundary line 
plot((plot_x*(maxX(1)-minX(1))+minX(1)), plot_y*(maxX(2)-minX(2))+minX(2))

fprintf('Cost at initial theta (zeros): %f\n', cost);
fprintf('Gradient at initial theta (zeros): \n');
fprintf(' %f \n', grad);

Note that, the correct boundary line is: $$x_2=-\frac{\theta_1x_1+\theta_0}{\theta_2}$$ And, the output is:

Best Answer

Related Solutions

Solved – Are N-1 distinct points with two classes Linearly Separable in an N Dimensional Space

Short answer

Longer answer

Generalisation

Solved – How to plot logistic decision boundary

Related Question