Solved – What to do when a linear regression gives negative estimates which are not possible

least squaresmultiple regressionregression

I am using linear regression to estimate values that in reality are always non-negative. The predictor variables are also non-negative. For instance, regressing the number of years of education and age to predict salary. All variables in this case are always non-negative.

Due to the negative intercept, my model (determined with OLS) results in some negative predictions (when the value of the predictor variable is low with respect to the range of all values).

This topic has already been covered here, and I am also aware that forcing the intercept at 0 is discouraged, so it seems that I have to accept this model as the one I have to use. However, my question here is about the accepted norms and rules when evaluating such model.
Are there any particular rules here? Specifically:

If I get a negative estimate can I just round it to 0?
If the observed value is 100, and the predicted value is -300, and I know that the minimum possible value is 0, is the error 400 or 100? For instance, when calculating the ME and RMSE.

If it is relevant to the discussion: I have used both simple linear regression and multiple linear regression. Both result in several negative values.

Edit:

Here is the example of the samples with the fit:

The coefficients of the linear regression are 0.0010(x) and -540 (intercept).

Here is what happens when I use log for the X:

Is linear regression suitable here?

Best Answer

You haven't given context, but you have linked to a post that offers one solution. I will assume that that solution is not applicable here.

Then another solution is to not use linear regression (simple or multiple) since they do not solve the problem you have.

First, though, let's use your of income as a function of age and education. Here, negative predicted values are reasonable because you are probably not interested in the income of newborn babies. However, there, taking log(income) is also reasonable, unless some people in your data set have no income.

But suppose that's not it. Then you can use a regression method that respects bounds on the dependent variable. One such is beta regression, which requires a DV that is between 0 and 1 - so you could scale your DV to be between 0 and 1 and then use beta regression.

But I would really urge you to add your actual variables to the question.

Related Solutions

Solved – getting rid of negative predictions in linear regression

As @Nick Cox points out in a comment, if you want your predicted values to always be positive, you don't want linear regression. If the dependent variable is a count (and maybe even if it is not) you could use Poisson regression or negative binomial regression. If it is bounded, you can transform it to 0-1 and then use beta regression. There are other options too.

Or it might be that you want to transform your dependent variable. If your DV is never negative then you can take the log. Then the predicted values on the raw score would never be negative.

Regression Models – Identifying Inaccuracies in OLS Linear Regression Parameters

Initially, before the massive edits, your question was asking about the definition of bias. Quoting my other answer

Let $X_1,\dots,X_n$ be your sample of independent and identically distributed random variables from distribution $F$. You are interested in estimating unknown but fixed quantity $\theta$, using estimator $g$ being a function of $X_1,\dots,X_n$. Since $g$ is a function of random variables, estimate

$$ \hat\theta_n = g(X_1,\dots,X_n)$$

is also a random variable. We define bias as

$$ \mathrm{bias}(\hat\theta_n) = \mathbb{E}_\theta(\hat\theta_n) - \theta $$

estimator is unbiased when $\mathbb{E}_\theta(\hat\theta_n) = \theta$.

This is the definition of bias in statistics (it is the one mentioned in bias-variance tradeoff). As you and others noted, people use the term "bias" for many different things, for example, we have sampling bias and bias nodes in neural networks (or described in here) in the area of machine learning, while outside statistics there are cognitive biases, you mentioned bias in electrical engineering etc. However if you are looking for some deeper philosophical connection between those concepts, then I'm afraid that you are looking too far.

Regarding "bias" shown on your examples

TLDR; Models you compare may not illustrate what you wanted to show and may be misleading. They illustrate the omitted-variable bias, rather then some kind of OLS bias in general.

Your first example is a handbook example of linear regression model

$$ y_i \sim \mathcal{N}(\alpha + \beta x_i, \;\sigma) $$

where $Y$ is a random variable and $X$ is fixed. In your second example you use

$$ x_i \sim \mathcal{N}(z_i, \;\sigma) \\ y_i \sim \mathcal{N}(z_i, \;\sigma) $$

so both $X$ and $Y$ are both random variables that are conditionally independent given $Z$. You want to model relationship between $Y$ and $X$. You seem to expect to see slope equal to unity as if $Y$ depended on $X$ what is not true by design of your example. To convince yourself, take a closer look at your model. Below I simulate similar data as yours, with the difference that $Z$ is uniformly distributed since for me it seems more realistic then using deterministic variable (it also will make things easier later on), so the model becomes

$$ z_i \sim \mathcal{U}(0, 10) \\ x_i \sim \mathcal{N}(z_i, \;\sigma) \\ y_i \sim \mathcal{N}(z_i, \;\sigma) $$

On the plot below you can see simulated data. On the first plot we see values of $X$ vs $Z$; on the second one $Y$ vs $Z$; on third $X$ vs $Y$ with fitted regression line; and on the final plot values of $X$ vs residuals from the described regression model (similar pattern to yours). Dependence of $X$ and $Y$ to $Z$ is obvious, the dependence of $X$ to $Y$ is illusory given the variable $Z$ that they both depend on. We call this an omitted-variable bias.

This will be even more clear if we look at the regression results:

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7371 -0.9900  0.0036  0.9293  4.1523 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.5842     0.1199   4.872 1.49e-06 ***
x             0.8827     0.0206  42.856  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.393 on 498 degrees of freedom
Multiple R-squared:  0.7867,    Adjusted R-squared:  0.7863 
F-statistic:  1837 on 1 and 498 DF,  p-value: < 2.2e-16

and compare them to results of model that includes $Z$:

Call:
lm(formula = y ~ x + z)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5871 -0.7032 -0.0118  0.6028  3.1817 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.03394    0.09146   0.371    0.711    
x           -0.01049    0.04532  -0.232    0.817    
z            1.00824    0.04825  20.895   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.018 on 497 degrees of freedom
Multiple R-squared:  0.8864,    Adjusted R-squared:  0.886 
F-statistic:  1940 on 2 and 497 DF,  p-value: < 2.2e-16

In the first case we see strong and significant slope for $X$ and $R^2 = 0.79$ (nice!). Notice however what happens if we add $Z$ to our model: slope for $X$ diminishes almost to zero and becomes insignificant, while slope for $Z$ is large and significant, $R^2$ increases to $0.89$. This shows us that it was $Z$ that "caused" the relationship between $X$ and $Y$ since controlling it "takes out" all the $X$'s influence.

Moreover, notice that, intentionally or not, you have chosen such parameters for $Z$ that make it's influence harder to notice at first sight. If you used, for example, $\mathcal{U}(0,1)$, then the residual pattern would be much more striking.

Basically, similar things will happen no matter what $Z$ is, since the effect is caused by the fact that both $X$ and $Y$ depend on $Z$. Below you can see plots from similar model, where $Z$ is normally distributed $\mathcal{N}(0,1)$. The $R^2$ increase for this model is from $0.26$ to $0.52$ when controlling for $Z$.

In each case $Y$ depended on $Z$ and it's relationship with $X$ was illusory and caused by the fact that they both depend on $Z$. This is an important problem in statistics, but it is not caused by any pitfalls of OLS regression, or our inability to measure bias, but by using a misspecified model that does not consider some important variable.

Coca-cola adverts do not cause snow to fall and do not make people give each other presents, those things just happen together on Christmas. It would be wrong to model snowfall predicted by the screenings of Coca-cola adverts while ignoring the fact that they both happen on December.

Sidenote: I guess that what you might have been thinking of is a random design regression (or random regression; e.g. Hsu et al, 2011, An analysis of random design linear regression) but I do not think that the example you provided is relevant for discussing it.

Best Answer

Related Solutions

Solved – getting rid of negative predictions in linear regression

Regression Models – Identifying Inaccuracies in OLS Linear Regression Parameters

Regarding "bias" shown on your examples

Related Question