[Math] In linear regression, we have 0 training error if data dimension is high, but are there similar results for other supervised learning problems

machine learningpr.probabilityprobability distributionsst.statistics

I tried posting this question on Cross Validated (the stack exchange for statistics) but didn't get an answer, so posting here:

Let's consider a supervised learning problem where $\{(x_1,y_1) \dots (x_n,y_n)\} \subset \mathbb{R}^p \times \mathbb{R}$ where $x_i \sim x$ are iid observations/samples, and $y_i \sim y$ are iid response variables. $y$ can be either continuous(regression) or discrete random variable (classification). To simplify things, you can treat the $x_i, y_i$'s below as individual input and output, as opposed to random vectors/variables.

We know that if the learning problem at hand is linear regression, then $p \ge n-1$ is sufficient to guarantee an interpolation – i.e. the hyperplane in $\mathbb{R}^{p+1} $ passing through (and not passing near) all the points $\{(x_1,y_1) \dots (x_n,y_n)\} \subset \mathbb{R}^p \times \mathbb{R}$, thereby giving us an exact zero training error (and not a small, positive training error).

My question is: are there such lower bound on the data dimension, a lower bound that's a function of the sample size $n,$ that ensures zero training errors when the supervised learning problem at hand is not a linear regression problem, but say a classification problem? To be more specific, assume that we're solving a logistic regression problem (or replace it by your favorite classification algorithm) with $n$ samples of dimension $p$. Now, irrespective of any distribution of the covariates/features, can we come up with a positive integer valued function $f$ so that $p \ge f(n)$ guarantees a perfect classification, i.e. zero training error (and not, small, positive training error)?

To be even more specific, let's consider the logistic regression, where given: $\{(x_1,y_1) \dots (x_n,y_n)\} \subset \mathbb{R}^p \times \{0,1\},$ one assumes: $$y_i|x_i \sim Ber(h_{\theta}(x_i)), h_{\theta}(x_i):= \sigma(\theta^{T}x_i), \sigma(z):= \frac{1}{1+e^{-z}},$$
and then finds the optimal parameter $\theta*$ of the model by:
$$\theta^{*}:= arg \hspace{1mm}max_{\theta \in \mathbb{R}^p} \sum_{i=1}^{n}y_iln(h_{\theta}(x_i)) + (1-y_i)ln (1 – h_{\theta}(x_i))$$

Is there a guarantee, just like linear regression, that when $p \ge f(n)$ for a certain positive integer-valued function $f,$ the training error is always zero, i.e. ${\theta^{*}}^{T}x_i>0$ when $y_i =1$ and ${\theta^{*}}^{T}x_i<0$ when $y_i =0,$ irrespective of the distribution of $x_i?$ P.S. I understand that when $p$ is large enough, perhaps just $p=n+1,$ there exists $\theta_1\in \mathbb{R}^p$ so that ${\theta_1}^{T}x_i>0$ when $y_i =1$ and ${\theta_1}^{T}x_i<0$ when $y_i =0,$ but why does the same has to be true for $\theta^{*}?$

The same question for other types of regression problems? I know the my question is broad, so some links that goes over the mathematical details will be greatly appreciated!

Best Answer

$\newcommand\th\theta\newcommand\R{\mathbb R}$In your logistic regression model, there is no function $f$ such that the condition $p\ge f(n)$ guarantees the zero training error.

Indeed, let us say that a point $x_i$ in your data is red if $y_i=1$ and blue of $y_i=0$. Let us say that $\th\in\R^p$ separates the red and blue points -- that is, has zero training error --- if $\th^Tx_i>0$ if $x_i$ is red and $\th^Tx_i<0$ if $x_i$ is blue.

Then, for any natural $p$, the zero training error cannot be attained by any $\th$ if e.g. (i) one of the $x_i$'s is $0$ or (ii) there are two red points of the form $u$ and $au$ for some real $a\ge0$ and some $u\in\R^p$ or (iii) there are two red points $u$ and $v$ and a blue point of the form $au+bv$ for some real $a,b\ge0$.

On the other hand, if the training data $\{(x_1,y_1),\dots,(x_n,y_n)\}$ admits some $\th_*\in\R^p$ that separates the red and blue points (that is, has zero training error), then your formula $$\th^*:= \text{arg max}_{\th\in\R^p}\sum_{i=1}^n(y_i\ln h_{\th}(x_i)+(1-y_i)\ln(1-h_{\th}(x_i))$$ makes no sense, because then the supremum of $$H(\th):=\sum_{i=1}^n(y_i\ln h_{\th}(x_i)+(1-y_i)\ln(1-h_{\th}(x_i))$$ over all $\th\in\R^p$ is not attained. Rather, this supremum (equal $0$) is "attained" only in the limit, when $\th=t\th_*$, $t\to\infty$, and, as above, $\th_*\in\R^p$ separates the red and blue points (that is, has zero training error).