Solved – What are the implications of the curse of dimensionality for ordinary least squares linear regression

high-dimensionalleast squaresregression

My understanding is that the curse of dimensionality implies that we need an exponential amount of data with respect to the number of features we include in our model. Is this correct?

If so, what does "we need" mean? Does it imply that we need at least that many data points to ensure we don't make a mistake?…to negate the effects of the dimensionality?..to ensure we've hit a global optimum?…something else?

Most important question(s) to me:

What specifically are the implications of the the curse of dimesionality for ordinary least squares linear regression?

If we are performing an OLS linear regression with p covariates, do we need 2^p data points?

I've read about rules of thumb for determining how many data points you need for OLS regression with respect to the number of covariates included in the model, and I know that the answer entirely depends on the properties of the data, but I'm trying to get a better understanding for how the curse of dimensionality plays a role in/affects this.

Best Answer

Edit: As @Richard Hardy pointed out, the linear model under squared loss and ordinary least squares (OLS) are different things. I revised my answer to discuss the linear regression model only, where we are trying to check if the curse of dimesionality (CoD) is present when solving the following optimization problem: $$ \min \|X\beta-y\|_2^2. $$

In most cases, linear regression model will not suffer from CoD. This is because the number of parameters in the OLS will NOT increase exponentially with respect to the number of features / independent variables / columns. (Unless we include all "interaction" terms for all features as mentioned in a comment.)

Suppose we have a data matrix $X$ that is $n \times p$, i.e., we have $n$ data points and $p$ features. It is possible in "machine learning context" that $n$ is on the scale of millions and $p$ is on the scale of thousands to millions. The linear model even works for $p \gg n$ as well once we add regularization.

To summarize

  • For the linear model, the number of parameters is the same as the number of features (let's assume we do not have the intercept.)

  • The CoD will happen when we have the number of parameters growing exponentially with the number of features. Here is an example: let us assume we have $p$ discrete (binary) random variables. The joint distribution table has $2^p$ rows. In this case, CoD will happen.

Related Question