Solved – Bias parameter in machine learning linear regression

machine learningregression

I am studying a linear regression example for machine learning. It makes the following definition:

As the name implies, linear regression solves a regression problem. In other words, the goal is to build a system that can take a vector $\mathbf{x} \in \mathbb{R}^n$ as input and predict the value of a scalar $y \in \mathbb{R}$ as its output. The output of linear regression is a linear function of the input. Let $\hat{y}$ be the value that our model predicts $y$ should take on. We define the output to be

$$\hat{y} = \mathbf{w}^T \mathbf{x}$$

where $\mathbf{w} \in \mathbb{R}^n$ is a vector of paramters.

Parameters are values that control the behaviour of the system. In this case, $w_i$ is the coefficient that we multiply by feature $x_i$ before summing up the contributions from all the features. We can think of $\mathbf{w}$ as a set of weights that determine how each feature affects the prediction. If a feature $x_i$ receives a positive weight $w_i$, then increasing the value of that feature increases the value of our prediction $\hat{y}$.

It then says the following:

It is worth noting that the term linear regression is often used to refer to a slightly more sophisticated model with one additional parameter — an intercept term $b$. In this model

$$\hat{y} = \mathbf{w}^T \mathbf{x} + b,$$

so the mapping from parameters to predictions is still a linear function but the mapping from features to predictions is now an affine function. This extension to affine functions means that the plot of the model's predictions still looks like a line, but it need not pass through the origin. Instead of adding the bias parameter $b$, one can continue to use the model with only weights but augment $\mathbf{x}$ with an extra entry that is always set to $1$. The weight corresponding to the extra $1$ entry plays the role of the bias parameter.

The intercept term $b$ is often called the bias parameter of the affine transformation. This terminology derives from the point of view that the output of the transformation is biased toward being $b$ in the absence of any input. This term is different from the idea of a statistical bias, in which a statistical estimation algorithm’s expected estimate of a quantity is not equal to the true quantity.

This is the part that I am interested in:

This terminology derives from the point of view that the output of the transformation is biased toward being $b$ in the absence of any input.

Can someone please elaborate on this? How is the transformation biased towards being $b$ "in the absence of any input"?

Thank you.

Best Answer

That seems like really confusing terminology, but what it means is, irrespective of the input $x$, the data will tend to be centered around $b$. If $x=0$ for all observations, the output of the regression would be $b$ in each case.

Bias here refers to a global offset not explained by the predictor variable. Consider the equation of a line:

$$ y = mx + c $$ Here $m$ is slope and $c$ is the intercept. If we omit the constant intercept $c$, $m$ as well as explaining the relationship between $x$ and $y$, must also account for the overall difference in scale irrespective of the value of $x$.

To demonstrate, if we have a really simple linear model in R with a constant difference between the variables (a difference in scale), then ignoring the intercept causes us to incorrectly estimate the relationship between $x$ and $y$ (the slope).

x <- rnorm(100)
y <- (3*x) + 100
lm(y ~ x)
#> 
#> Call:
#> lm(formula = y ~ x)
#> 
#> Coefficients:
#> (Intercept)            x  
#>         100            3
lm(y ~ 0 + x)
#> 
#> Call:
#> lm(formula = y ~ 0 + x)
#> 
#> Coefficients:
#>      x  
#> -5.505
Related Question