Solved – OLS results dependent on scaling of independent variable

least squarespython

I have the following python dataframe (5 rows out of nearly 15,000):

   N0_YLDF        MAT       MAP
0  6.286333  11.669069  548.8765
1  6.317000  11.669069  548.8765
2  6.324889  11.516454  531.5035
3  6.320667  11.516454  531.5035
4  6.325556  11.516454  531.5035

I run OLS using following formula:
N0_YLDF ~ MAT + MAP

Here, MAP stands for mean annual precipitation. The coefficient of y-intercept for MAP is 0.0079. However, when I divide MAP by 100, the coefficient increases to 0.79

Does this mean that I should normalize my independent variables?

Best Answer

@Alecos gave a very thorough mathematical answer.

More intuitively, suppose you regress weight on height. You measure weight in pounds and height in inches. Then you are told you should have measured weight in kilos and height in centimeters. Many of the numbers in the regression results will change, but the meaning will stay exactly the same.

So, in answer to your question: No. The fact you stated doesn't mean anything; it is an automatic consequence of what you did. Whether you should standardize your variables is a good question, but the changes you noted give no help in answering it.

Related Solutions

Solved – The distinction between stochastic independent variable and measurement error in independent OLS variable

You ask "isn't assuming that $X$ is random similar to assuming that there is measurement error in $X$", in the sense that it biases $\hat{\beta}$?

The answer is no. Here's a little R simulation you can play with to help convince yourself:

get_df <- function(n_obs=10^3, true_beta=c(5, -1, 10, 5)) {
    stopifnot(length(true_beta) == 4)  # Coefficients on constant, x1, x2, x3
    df <- data.frame(x1=rnorm(n_obs), x2=rnorm(n_obs), epsilon=rnorm(n_obs), constant=1)
    df$x3 <- 2*df$x1 + df$x2 + rnorm(n_obs)
        df$y <- as.matrix(df[, c("constant", "x1", "x2", "x3")], n_obs, 4) %*% true_beta + df$epsilon
        df$x1_noisy <- df$x1 + rnorm(n_obs, sd=5)
    return(df)
}

get_beta_hat <- function(df, formula=y ~ 1 + x1 + x2 + x3) {
    fit <- lm(formula, data=df)
    return(coefficients(fit))
}

set.seed(543299)

beta_hat <- t(replicate(1000, get_beta_hat(get_df())))
colMeans(beta_hat)  # Very close to true values of (5, -1, 10, 5), even with stochastic X
apply(beta_hat, MARGIN=2, FUN=sd)

beta_hat_noisy_x1 <- t(replicate(1000, get_beta_hat(get_df(), y ~ 1 + x1_noisy + x2 + x3)))
colMeans(beta_hat_noisy_x1)  # I got (5, -0.01, 10.4, 4.6)
apply(beta_hat, MARGIN=2, FUN=sd)

The code simulates data where $$Y \equiv X\,\beta + \epsilon,$$with $X$ and $\epsilon$ (and therefore $Y$) random. Since $$\mathbb{E}\left[\epsilon | X\right] = 0,$$we have $\mathbb{E}\left[\hat{\beta} | X\right] = \beta$, i.e. our estimates of $\beta$ are unbiased conditional on any realization of $X$. That implies that they are unbiased unconditionally: $$\mathbb{E}\left[\hat{\beta}\right] = \mathbb{E}\left[\mathbb{E}\left[\hat{\beta} | X\right]\right] = \beta.$$

The issue with measurement error -- the reason it's different from randomness in $X$ -- is that measurement error does not enter the definition of $Y$. In other words, if you regress $Y$ on noisy $X$, you are fitting the wrong model; the true model regresses $Y$ on non-noisy (but possibly random) $X$.

Solved – Ridge Regression with Gradient Descent Converges to OLS estimates

I'm not very well versed in Python, but if I interpret your code correctly then X.T.dot(gradient) takes the gradient variable you computed on the previous line, and computes its dot product with $X$. This doesn't seem right since 'gradient' includes the "L2-gradient" which shouldn't be multiplied by $X$. Only the residual (y-self.output(X)) should be in that product. You want to add the L2-gradient only afterwards, and then multiply the result by eta.

Also, the L2-gradient shouldn't sum over the $w$'s (remember the gradient should be vector-valued since you're taking the derivative of a scalar-valued function w.r.t a vector). Together these errors probably explain the confusing results you're getting, although I would have expected with an incorrect gradient like that the output would actually just be garbage rather than the OLS-solution, so there is probably something subtle happening that I'm not seeing.

(Note that your mathematical expressions for the gradient also aren't quite right, but there you actually omitted the pre-multiplication of the residuals with $X^T$. And you also again have a sum over $w$'s in the L2-gradient which shouldn't be there.)

Going forward I would recommend first checking your implementation of each individual part of the cost function and gradient and satisfy yourself that they are correct, before running gradient descent.

Best Answer

Related Solutions

Solved – The distinction between stochastic independent variable and measurement error in independent OLS variable

Solved – Ridge Regression with Gradient Descent Converges to OLS estimates

Related Question