Solved – What are the implications of the curse of dimensionality for ordinary least squares linear regression

high-dimensionalleast squaresregression

My understanding is that the curse of dimensionality implies that we need an exponential amount of data with respect to the number of features we include in our model. Is this correct?

If so, what does "we need" mean? Does it imply that we need at least that many data points to ensure we don't make a mistake?…to negate the effects of the dimensionality?..to ensure we've hit a global optimum?…something else?

Most important question(s) to me:

What specifically are the implications of the the curse of dimesionality for ordinary least squares linear regression?

If we are performing an OLS linear regression with p covariates, do we need 2^p data points?

I've read about rules of thumb for determining how many data points you need for OLS regression with respect to the number of covariates included in the model, and I know that the answer entirely depends on the properties of the data, but I'm trying to get a better understanding for how the curse of dimensionality plays a role in/affects this.

Best Answer

Edit: As @Richard Hardy pointed out, the linear model under squared loss and ordinary least squares (OLS) are different things. I revised my answer to discuss the linear regression model only, where we are trying to check if the curse of dimesionality (CoD) is present when solving the following optimization problem: $$ \min \|X\beta-y\|_2^2. $$

In most cases, linear regression model will not suffer from CoD. This is because the number of parameters in the OLS will NOT increase exponentially with respect to the number of features / independent variables / columns. (Unless we include all "interaction" terms for all features as mentioned in a comment.)

Suppose we have a data matrix $X$ that is $n \times p$, i.e., we have $n$ data points and $p$ features. It is possible in "machine learning context" that $n$ is on the scale of millions and $p$ is on the scale of thousands to millions. The linear model even works for $p \gg n$ as well once we add regularization.

To summarize

For the linear model, the number of parameters is the same as the number of features (let's assume we do not have the intercept.)
The CoD will happen when we have the number of parameters growing exponentially with the number of features. Here is an example: let us assume we have $p$ discrete (binary) random variables. The joint distribution table has $2^p$ rows. In this case, CoD will happen.

Related Solutions

Solved – What are the Assumptions required in Regression Models, Ordinary Least Square, and Multiple Regression Models

It would be difficult to be clearer than what has been said for the other posts. Nevertheless I will try to say something to the point that addresses the different assumptions that are needed for OLS and various other estimation techniques to be appropriate to use.

OLS estimation: This is applied in both simple linear and mutliple regression where the common assumptions are (1) the model is linear in the coefficients of the predictor with an additive random error term (2) the random error terms are (a) normally distributed with 0 mean and (b) a variance that doesn't change as the values of the predictor covariates (i.e. IVs) change, Note also that in this framework which applies in both simple and multiple regression the covariates are assumed to be known without any uncertainty in their given values. OLS can be used when either A) only (1) holds with 2(b) or B) both (1) and (2) hold.

If B) can be assumed OLS has some nice properties that make it attractive to use. (I) MINIMUM VARIANCE AMONG UNBIASED ESTIMATORS (II) MAXIMUM LIKELIHOOD (III) CONSISTENT AND ASYMPTOTICALLY NORMALITY AND EFFICIENCY UNDER CERTAIN REGULARITY CONDITIONS

Under B) OLS can be used for both estimation and predictions and both confidence and prediction intervals can be generated for the fitted values and predictions.

IF only A) holds we still have property (I) but not (II) or (III). If your objective is to fit the model and you don't need confidence or prediction interval for the repsonse given the covariate and you don't need confidence intervals for the regression parameters then OLS can be used under A). But you cannot test for significance of the coefficients in the model using the t tests that are often used nor can you apply the F test for overall model fit or the one for equality of variances. But the Gauss-Markov theorem tells you that property I still holds. However in case A) since (II) and (III) no longer hold other more robust estimation procedures may be better than least squares even though they are not unbiased. This is particularly true when the error distribution is heavytailed and you see outliers in the data. The least squares estimates are very sensitive to outliers.

What else can go wrong with using OLS?

Error variances not homogeneous means a weighted least squares method may be preferable to OLS.

High degree of collinearity among predictors means that either some predictors should be removed or another estimation procedure such as ridge regression should be used. The OLS estimated coefficients can be highly unstable when there is a high degree of multicollinearity.

If the covariates are observed with error (e.g. measurement error) then the model assumption that the covariates are given without error is violated. This is bad for OLS because the criteria minimizes the residuals in the direction of the response variable assuming no error to worry about in the direction of the covariates. This is called the error in variables problem and a solution that takes account of these errors in the covariate directions will do better. Error in variables (aka Deming) regression minimizes the sum of squared deviations in a direction that takes account of the ratios of these variances.

This is a little complicated because many assumptions are involved in these models and objectives play a role in deciding which assumptions are crucial for a given analysis. But if you focus on the properties one at a time to see the consequences of the violation of an assumption it might be less confusing.

Solved – Generalized Least Squares vs Ordinary Least Squares under a special case

There are two questions. First, there is a purely mathematical question about the possibility of decomposing the GLS estimator into the OLS estimator plus a correction factor. Second, there is a question about what it means when OLS and GLS are the same. (I will use ' rather than T throughout to mean transpose).

Also, I would appreciate knowing about any errors you find in the arguments.

Question 1

Ordinary Least Squares (OLS) solves the following problem: \begin{align} min_x\;\left(y-Hx\right)'\left(y-Hx\right) \end{align} leading to the solution: \begin{align} \hat{x}_{OLS}=\left(H'H\right)^{-1}H'y \end{align} Generalized Least Squares (GLS) solves the following problem: \begin{align} min_x\;\left(y-Hx\right)'C^{-1}\left(y-Hx\right) \end{align} leading to the solution: \begin{align} \hat{x}_{OLS}=\left(H'C^{-1}H\right)^{-1}H'C^{-1}y \end{align} Now, make the substitution $C^{-1}=X+I$ in the GLS problem: \begin{align} min_x\;&\left(y-Hx\right)'\left(X+I\right)\left(y-Hx\right)\\~\\ min_x\;&\left(y-Hx\right)'X\left(y-Hx\right) + \left(y-Hx\right)'\left(y-Hx\right)\\ \end{align} The solution is still characterized by first order conditions since we are assuming that $C$ and therefore $C^{-1}$ are positive definite: \begin{align} 0=&2\left(H'XH\hat{x}_{GLS}-H'Xy\right) +2\left(H'H\hat{x}_{GLS}-H'y\right)\\ \hat{x}_{GLS}=&\left(H'H\right)^{-1}H'y+\left(H'H\right)^{-1}H'Xy -\left(H'H\right)^{-1}H'XH\hat{x}_{GLS}\\ \hat{x}_{GLS}=& \hat{x}_{OLS} + \left(H'H\right)^{-1}H'Xy -\left(H'H\right)^{-1}H'XH\hat{x}_{GLS}\\ \end{align}

I can see two ways to give you what you asked for in the question from here. First, we have a formula for the $\hat{x}_{GLS}$ on the right-hand-side of the last expression, namely $\left(H'C^{-1}H\right)^{-1}H'C^{-1}y$. Thus, the above expression is a closed form solution for the GLS estimator, decomposed into an OLS part and a bunch of other stuff. The other stuff, obviously, goes away if $H'X=0$. To be clear, one possible answer to your first question is this: \begin{align} \hat{x}_{GLS}=& \hat{x}_{OLS} + \left(H'H\right)^{-1}H'Xy -\left(H'H\right)^{-1}H'XH\left(H'C^{-1}H\right)^{-1}H'C^{-1}y\\ \hat{x}_{GLS}=& \hat{x}_{OLS} + \left(H'H\right)^{-1}H'X \left(I -H\left(H'C^{-1}H\right)^{-1}H'C^{-1}\right)y \end{align} I can't say I get much out of this. That awful mess near the end multiplying $y$ is a projection matrix, but onto what?

Another way you could proceed is to go up to the line right before I stopped to note there are two ways to proceed and to continue thus: \begin{align} \left(I+\left(H'H\right)^{-1}H'XH\right)\hat{x}_{GLS}=& \hat{x}_{OLS} + \left(H'H\right)^{-1}H'Xy\\ \hat{x}_{GLS}=& \left(I+\left(H'H\right)^{-1}H'XH\right)^{-1}\left(\hat{x}_{OLS} + \left(H'H\right)^{-1}H'Xy\right) \end{align} Again, GLS is decomposed into an OLS part and another part. The other part goes away if $H'X=0$. I still don't get much out of this. What this one says is that GLS is the weighted average of OLS and a linear regression of $Xy$ on $H$. I guess you could think of $Xy$ as $y$ suitably normalized--that is after having had the "bad" part of the variance $C$ divided out of it.

I should be careful and verify that the matrix I inverted in the last step is actually invertible: \begin{align} \left(I+\left(H'H\right)^{-1}H'XH\right) &= \left(H'H\right)^{-1}\left(H'H+H'XH\right)\\ &= \left(H'H\right)^{-1}H'\left(I+X\right)H\\ &= \left(H'H\right)^{-1}H'C^{-1}H \end{align}

Question 2

The question here is when are GLS and OLS the same, and what intuition can we form about the conditions under which this is true? I will only provide an answer here for a special case on the structure of $C$. The requirement is: \begin{align} \left(H'C^{-1}H\right)^{-1}H'C^{-1}Y = \left( H'H\right)^{-1}H'Y \end{align}

To form our intuitions, let's assume that $C$ is diagonal, let's define $\overline{c}$ by $\frac{1}{\overline{c}}=\frac{1}{K}\sum \frac{1}{C_{ii}}$, and let's write: \begin{align} \left(H'C^{-1}H\right)^{-1}H'C^{-1}Y &= \left(H'\overline{c}C^{-1}H\right)^{-1}H'\overline{c}C^{-1}Y\\ &=\left( H'H\right)^{-1}H'Y \end{align}

One way for this equation to hold is for it to hold for each of the two factors in the equation: \begin{alignat}{3} \left(H'\overline{c}C^{-1}H\right)^{-1} &=\left( H'H\right)^{-1} & \iff& & H'\left(\overline{c}C^{-1}-I\right)H&=0\\ H'\overline{c}C^{-1}Y&=H'Y & \iff& & H'\left(\overline{c}C^{-1}-I\right)Y&=0 \end{alignat} Remembering that $C$, $C^{-1}$, and $I$ are all diagonal and denoting by $H_i$ the $i$th row of $H$: \begin{alignat}{3} H'\left(\overline{c}C^{-1}-I\right)H&=0 & \iff& & \frac{1}{K} \sum_{i=1}^K H_iH_i'\left( \frac{\overline{c}}{C_{ii}}-1\right)=0\\~\\ H'\left(\overline{c}C^{-1}-I\right)Y&=0 & \iff& & \frac{1}{K} \sum_{i=1}^K H_iY_i\left( \frac{\overline{c}}{C_{ii}}-1\right)=0 \end{alignat} What are those things on the right-hand-side of the double-headed arrows? They are a kind of sample covariance. To see this, notice that the mean of $\frac{\overline{c}}{C_{ii}}$ is 1, by the construction of $\overline{c}$. Finally, we are ready to say something intuitive. In this special case, OLS and GLS are the same if the inverse of the variance (across observations) is uncorrelated with products of the right-hand-side variables with each other and products of the right-hand-side variables with the left-hand-side variable. This is a very intuitive result.

In estimating the linear model, we only use the products of the RHS variables with each other and with the LHS variable, $(H'H)^{-1}H'y$. In GLS, we weight these products by the inverse of the variance of the errors. When does that re-weighting do nothing, on average? Why, when the weights are uncorrelated with the thing they are re-weighting! Yes? When is a weighted average the same as a simple average? When the weights are uncorrelated with the things you are averaging.

This insight, by the way, if I am remembering correctly, is due to White(1980) and perhaps Huber(1967) before him---I don't recall exactly.

Best Answer

Related Solutions

Solved – What are the Assumptions required in Regression Models, Ordinary Least Square, and Multiple Regression Models

Solved – Generalized Least Squares vs Ordinary Least Squares under a special case

Related Question