You ask "isn't assuming that $X$ is random similar to assuming that there is measurement error in $X$", in the sense that it biases $\hat{\beta}$?
The answer is no. Here's a little R simulation you can play with to help convince yourself:
get_df <- function(n_obs=10^3, true_beta=c(5, -1, 10, 5)) {
stopifnot(length(true_beta) == 4) # Coefficients on constant, x1, x2, x3
df <- data.frame(x1=rnorm(n_obs), x2=rnorm(n_obs), epsilon=rnorm(n_obs), constant=1)
df$x3 <- 2*df$x1 + df$x2 + rnorm(n_obs)
df$y <- as.matrix(df[, c("constant", "x1", "x2", "x3")], n_obs, 4) %*% true_beta + df$epsilon
df$x1_noisy <- df$x1 + rnorm(n_obs, sd=5)
return(df)
}
get_beta_hat <- function(df, formula=y ~ 1 + x1 + x2 + x3) {
fit <- lm(formula, data=df)
return(coefficients(fit))
}
set.seed(543299)
beta_hat <- t(replicate(1000, get_beta_hat(get_df())))
colMeans(beta_hat) # Very close to true values of (5, -1, 10, 5), even with stochastic X
apply(beta_hat, MARGIN=2, FUN=sd)
beta_hat_noisy_x1 <- t(replicate(1000, get_beta_hat(get_df(), y ~ 1 + x1_noisy + x2 + x3)))
colMeans(beta_hat_noisy_x1) # I got (5, -0.01, 10.4, 4.6)
apply(beta_hat, MARGIN=2, FUN=sd)
The code simulates data where $$Y \equiv X\,\beta + \epsilon,$$with $X$ and $\epsilon$ (and therefore $Y$) random. Since $$\mathbb{E}\left[\epsilon | X\right] = 0,$$we have $\mathbb{E}\left[\hat{\beta} | X\right] = \beta$, i.e. our estimates of $\beta$ are unbiased conditional on any realization of $X$. That implies that they are unbiased unconditionally: $$\mathbb{E}\left[\hat{\beta}\right] = \mathbb{E}\left[\mathbb{E}\left[\hat{\beta} | X\right]\right] = \beta.$$
The issue with measurement error -- the reason it's different from randomness in $X$ -- is that measurement error does not enter the definition of $Y$. In other words, if you regress $Y$ on noisy $X$, you are fitting the wrong model; the true model regresses $Y$ on non-noisy (but possibly random) $X$.
I'm not very well versed in Python, but if I interpret your code correctly then X.T.dot(gradient) takes the gradient variable you computed on the previous line, and computes its dot product with $X$. This doesn't seem right since 'gradient' includes the "L2-gradient" which shouldn't be multiplied by $X$. Only the residual (y-self.output(X)) should be in that product. You want to add the L2-gradient only afterwards, and then multiply the result by eta.
Also, the L2-gradient shouldn't sum over the $w$'s (remember the gradient should be vector-valued since you're taking the derivative of a scalar-valued function w.r.t a vector). Together these errors probably explain the confusing results you're getting, although I would have expected with an incorrect gradient like that the output would actually just be garbage rather than the OLS-solution, so there is probably something subtle happening that I'm not seeing.
(Note that your mathematical expressions for the gradient also aren't quite right, but there you actually omitted the pre-multiplication of the residuals with $X^T$. And you also again have a sum over $w$'s in the L2-gradient which shouldn't be there.)
Going forward I would recommend first checking your implementation of each individual part of the cost function and gradient and satisfy yourself that they are correct, before running gradient descent.
Best Answer
@Alecos gave a very thorough mathematical answer.
More intuitively, suppose you regress weight on height. You measure weight in pounds and height in inches. Then you are told you should have measured weight in kilos and height in centimeters. Many of the numbers in the regression results will change, but the meaning will stay exactly the same.
So, in answer to your question: No. The fact you stated doesn't mean anything; it is an automatic consequence of what you did. Whether you should standardize your variables is a good question, but the changes you noted give no help in answering it.