Regression – Determining if Any Data Can Be Learned Using Polynomial Logistic Regression in Machine Learning

classificationlogisticmachine learningpolynomialregression

We know that a Taylor polynomial can approximate any continuous function. As @DemetriPananos noticed, Logistic regression seeks to estimate the coefficients for a model and any cut off is imposed post facto. But suppose there's a best possible decision boundary for our data. By "best possible" I mean a decision boundary that perfectly separates two classes.

Assume, for the sake of simplicity, that there is no data points from positive class that overlay data points from negative class (as @Sycorax suggested). For example consider this plot:

enter image description here

The blue line perfectly separates two classes. But the blue line itself doesn't represent a function.

  1. If we were to increase a degree of polynomial in our logistic regression, can we be sure that such a perfect decision boundary would be found for any data that can be perfectly separated?

  2. If the answer to my first question is "yes", then how to prove (or show) it?

Best Answer

Comments to the question suggest the following interpretation:

Given any two non-overlapping finite collections of points $A$ and $B$ in a Euclidean space $E^n,$ does there always exist a polynomial function $f_{A,B}:E^n\to\mathbb R$ that perfectly separates the collections? That is, $f_{A,B}$ has positive values on all points of $A$ and negative values on all points of $B.$

The answer is yes, by construction.

Let $|\ |$ be the usual Euclidean distance. Its square is a quadratic polynomial. Specifically, using any orthogonal coordinate system write $\mathbf{x}=(x_1,\ldots, x_n)$ and $\mathbf{y}=(y_1,\ldots, y_n).$ We have

$$|\mathbf{x}-\mathbf{y}|^2 = \sum_{i=1}^n (x_i-y_i)^2,$$

which explicitly is a quadratic polynomial function of the coordinates.

Define $$f_{A,B}(\mathbf x)=\left[\sum_{\mathbf y\in A}\frac{1}{|\mathbf x-\mathbf y|^2}-\sum_{\mathbf y\in B}\frac{1}{|\mathbf x-\mathbf y|^2}\right]\prod_{\mathbf y\in A\cup B}|\mathbf x-\mathbf y|^2.$$

Notice how $f_{A,B}$ is defined as a product. The terms on the right hand side clear the denominators of the fractions on the left, showing that $f$ is actually defined everywhere on $E^n$ and is a polynomial function.

The function in the left term of the product has poles (explodes to $\pm \infty$) precisely at the data points $\mathbf x \in A\cup B.$ At the points of $A$ its values diverge to $+\infty$ and at the points of $B$ its values diverge to $-\infty.$ Because the product at the right is non-negative, we see that in a sufficiently small neighborhood of $A$ $f_{A,B}$ is always positive and in a sufficiently small neighborhood of $B$ $f_{A,B}$ is always negative. Thus $f_{A,B}$ does its job of separating $A$ from $B,$ QED.

Here is an illustration showing the contour $f_{A,B}=0$ for $80$ randomly selected points in the plane $E^2.$ Of these, $43$ were randomly selected to form the subset $A$ (drawn as blue triangles) and others form the subset $B,$ drawn as red circles. You can see this construction works because all blue triangles fall within the gray (positive) region where $f_{A,B}\gt 0$ and all the red circles fall within the interior of its complement where $f_{A,B}\lt 0.$

Figure

To see more examples, modify and run this R script that produced the figure. Its function f, defined at the outset, implements the construction of $f_{A,B}.$

#
# The columns of `A` are all data points.  The values of `I` are +/-1, indicating
# the subset each column belongs to.
#
f <- function(x, A, I) {
  d2 <- colSums((A-x)^2)
  j <- d2 == 0           # At most one point, assuming all points in `A` are unique
  if (sum(j) > 0)        # Avoids division by zero
    return(prod(d2[!j]) * prod(I[j])) 
  sum(I / d2) * prod(d2)
}
#
# Create random points and a random binary classification of them.
#
# set.seed(17)
d <- 2   # Dimensions                 
n <- 80  # total number of points
p <- 1/2 # Expected Fraction in `A`
A <- matrix(runif(d*n), d)
I <- sample(c(-1,1), ncol(A), replace=TRUE, prob=c(1-p, p))
#
# Check `f` by applying it to the data points and confirming it gives the
# correct signs.
#
I. <- sign(apply(A, 2, f, A=A, I=I))
if (!isTRUE(all.equal(I, I.))) stop("f does not work...")
#
# For plotting, compute values of `f` along a slice through the space.
#
slice <- rep(1/2, d-2) # Choose which slice to plot
X <- Y <- seq(-0.2, 1.2, length.out=201)
Z <- matrix(NA_real_, length(X), length(Y))
for (i in seq_along(X)) for (j in seq_along(Y)) 
    Z[i, j] <- f(c(X[i], Y[j], slice), A, I)
#
# Display a 2D plot.
#
image(X, Y, sign(Z), col=c("Gray", "White"), xaxt="n", yaxt="n", asp=1, bty="n",
      main="Polynomial separator of random points")
contour(X, Y, Z, levels=0, labels="", lwd=2, labcex=0.001, add=TRUE)
points(t(A), pch=ifelse(I==1, 19, 17), col=ifelse(I==1, "Red", "Blue"))
Related Question