The wikipedia definition is a fine definition that you can use for your paper if you need one but I think you're missing something.
The $\epsilon$ is random error, which is synonymous with noise. In practice, the random error can be Gaussian distributed, in which case it is Gaussian noise, but it could take on other distributions. If the distribution of $\epsilon$ happens to be Gaussian then you've met one of the theoretical assumptions of the model and things like interval estimation are better justified. If it's not Gaussian then, like Glen_b said, you still have that it's best linear unbiased.
Theoretically, the random error (noise) is supposed to be Gaussian distributed but the outcome could be anything. So, in order to answer your question you'd need to state whether you want to know the distribution of your particular noise or what the distribution of the noise should be. For the former you'd need data.
do I ALWAYS get a tighter confidence interval if I include more variables in my model?
Yes, you do (EDIT: ...basically. Subject to some caveats. See comments below). Here's why: adding more variables reduces the SSE and thereby the variance of the model, on which your confidence and prediction intervals depend. This even happens (to a lesser extent) when the variables you are adding are completely independent of the response:
a=rnorm(100)
b=rnorm(100)
c=rnorm(100)
d=rnorm(100)
e=rnorm(100)
summary(lm(a~b))$sigma # 0.9634881
summary(lm(a~b+c))$sigma # 0.961776
summary(lm(a~b+c+d))$sigma # 0.9640104 (Went up a smidgen)
summary(lm(a~b+c+d+e))$sigma # 0.9588491 (and down we go again...)
But this does not mean you have a better model. In fact, this is how overfitting happens.
Consider the following example: let's say we draw a sample from a quadratic function with noise added.
A first order model will fit this poorly and have very high bias.
A second order model fits well, which is not surprising since this is how the data was generated in the first place.
But let's say we don't know that's how the data was generated, so we fit increasingly higher order models.
As the complexity of the model increases, we're able to capture more of the fluctuations in the data, effectively fitting our model to the noise, to patterns that aren't really there. With enough complexity, we can build a model that will go through each point in our data nearly exactly.
As you can see, as the order of the model increases, so does the fit. We can see this quantitatively by plotting the training error:
But if we draw more points from our generating function, we will observe the test error diverges rapidly.
The moral of the story is to be wary of overfitting your model. Don't just rely on metrics like adjusted-R2, consider validating your model against held out data (a "test" set) or evaluating your model using techniques like cross validation.
For posterity, here's the code for this tutorial:
set.seed(123)
xv = seq(-5,15,length.out=1e4)
X=sample(xv,20)
gen=function(v){v^2 + 7*rnorm(length(v))}
Y=gen(X)
df = data.frame(x=X,y=Y)
plot(X,Y)
lines(xv,xv^2, col="blue") # true model
legend("topleft", "True Model", lty=1, col="blue")
build_formula = function(N){
paste('y~poly(x,',N,',raw=T)')
}
deg=c(1,2,10,20)
formulas = sapply(deg[2:4], build_formula)
formulas = c('y~x', formulas)
pred = lapply(formulas
,function(f){
predict(
lm(f, data=df)
,newdata=list(x=xv))
})
# Progressively add fit lines to the plot
lapply(1:length(pred), function(i){
plot(df, main=paste(deg[i],"-Degree"))
lapply(1:i,function(n){
lines(xv,pred[[n]], col=n)
})
})
# let's actually generate models from poly 1:20 to calculate MSE
deg=seq(1,20)
formulas = sapply(deg, build_formula)
pred.train = lapply(formulas
,function(f){
predict(
lm(f, data=df)
,newdata=list(x=df$x))
})
pred.test = lapply(formulas
,function(f){
predict(
lm(f, data=df)
,newdata=list(x=xv))
})
rmse.train = unlist(lapply(pred.train,function(P){
regr.eval(df$y,P, stats="rmse")
}))
yv=gen(xv)
rmse.test = unlist(lapply(pred.test,function(P){
regr.eval(yv,P, stats="rmse")
}))
plot(rmse.train, type='l', col='blue'
, main="Training Error"
,xlab="Model Complexity")
plot(rmse.test, type='l', col='red'
, main="Train vs. Test Error"
,xlab="Model Complexity")
lines(rmse.train, type='l', col='blue')
legend("topleft", c("Test","Train"), lty=c(1,1), col=c("red","blue"))
Best Answer
If I understand your question correctly, the issue is that X1, X2, and X3 are all highly correlated. That's a problem with multicollinearity among your predictors rather than non-independence in your data (grouping).
There are a number of solutions for this. The simplest solution is to drop redundant variables, if you're okay with that. If X1, X2, and X3 are all highly correlated, then a model that just includes X1 and X4 might be fine. If for some reason you don't want to drop any variables, you can use principal components analysis to separate them into orthogonal components, or use another type of model that handles multicollinearity well like ridge regression. Here's a relevant answer with some useful links: https://stats.stackexchange.com/a/124232/131407