Solved – Negative fitted values in OLS regression

linear modelregression

I am running a regression where my dependent variable is a cross-section of variances. Therefore, I require my predicted values (fitted values) to be positive.

However, when running a simple OLS regression, a small percentage of my fitted values are negative, which is non-intuitive in this case (since variance cannot be negative).

Please note that approximately, my dependent variable is distributed according to a Chi-square distribution.

The output that I need from the regression are the fitted values in the original scale, as well a closed form expression of the MSE (Mean Square Error) of these fitted values.

Is there a way to impose a lower bound on the predicted values?

Best Answer

I am running a regression where my dependent variable is a cross-section of variances. Therefore, I require my predicted values (fitted values) to be positive.

Then don't fit a model that doesn't obey such an obvious requirement...

However, when running a simple OLS regression,

... like, you know, OLS.

Please note that approximately, my dependent variable is distributed according to a Chi-square distribution.

Or rather, since population variances are usually not $1$, it should probably be approximately $\sigma^2$ times a chi-square -- so why not model it as, say a Gamma random variable (the distribution of a multiple of a chi-square)?

So why not use a GLM for this problem? All your fitted values are guaranteed to not go negative. See the example here (however, if you fit a straight line model, predicted values can - indeed, must - still go negative outside the data).

Is there a way to impose a lower bound on the predicted values?

If you fit a model for the mean such that the mean will remain positive (log-link, say, rather than identity-link) then out-of-sample predictions will obey the positivity restriction.

If you're modelling variances, the identity link usually won't make sense anyway. Choose one of the others, and the model - fitted and predicted - will stay positive.

Related Solutions

Residuals Plot – Why Residuals vs. Fitted Values Plot is a Horizontal Line

Scortchi and Peter Flom have both correctly pointed out that you didn't fit the model you specified.

However, there's no coefficient on $\sin(x)$ in the model, so if you actually want to fit $y_i = \alpha + \sin(x_i) + \epsilon_i$ you should not regress on $\sin(x)$. In that model it's an offset, not a regressor.

The correct way to specify the model

$$y_i = \alpha + \sin(x_i) + \epsilon_i$$

in R is:

model <- lm(y ~ 1, offset=sin(x), data=df)

which produces the residual vs fitted plot:
enter image description here

or as a residuals vs x plot:
enter image description here

Alternatively, one could fit

model2 <- lm( y-sin(x) ~ 1, data=df)

which gives the same estimate for $\alpha$. The residual vs fitted plot is of no use in this case (because of the difference in the way the offset was brought into the model by modifying $y$), but the residuals vs x plot is identical to the second plot above.

Gung is right to suggest in comments that it often makes sense to fit the offset as a regressor anyway (for example, to check that the offset-coefficient of 1 is reasonable); this is the model that Scortchi and Peter Flom were discussing in comments.

Here's how you do that:

model3 <- lm( y ~ sin(x), data=df)

If we look at the summary (summary(model3)) we get:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.000717   0.032050   0.022    0.982    
sin(x)      1.069593   0.044947  23.797   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9921 on 998 degrees of freedom
Multiple R-squared:  0.362, Adjusted R-squared:  0.3614 
F-statistic: 566.3 on 1 and 998 DF,  p-value: < 2.2e-16

which has coefficients close to what we'd expect.

Finally, you might do this:

model4 <- lm( y ~ sin(x),offset=sin(x), data=df)

but its effect is only to reduce the fitted coefficient of $\sin(x)$ by 1, so we can extract the same information from model3's output.

Solved – Weighted least-squares negative fitted values

Transform your dependent variable $\sigma^2$ with a logarithm and fit the following model

$$\log \sigma_i^2 = \alpha + \beta x_i + \epsilon_i$$

Get an estimate of the variance $\eta^2$ of the residuals as

$$\hat{\eta^2} = \frac{1}{N}\sum_{i=1}^N (\log \sigma_i^2 - \alpha - \beta x_i)^2$$

Finally, use the estimator

$$\hat{\sigma_i^2} = \exp\left(\hat{\alpha} + \hat{\beta} x_i + \frac{\hat{\eta^2}}{2}\right)$$

The reason there is a $\frac{1}{2}\hat{\eta^2}$ term is because if $\epsilon$ is normally distributed with mean 0, $e^\epsilon$ has expectation $e^{\eta^2/2}$.

Since $\hat{\sigma_i^2 }$ is an exponential, it will always be positive.

Best Answer

Related Solutions

Residuals Plot – Why Residuals vs. Fitted Values Plot is a Horizontal Line

Solved – Weighted least-squares negative fitted values

Related Question