Solved – Question about the error term in a simple linear regression

econometricslinear modelregression

Suppose we have a linear regression model $Y_{it} = \beta_0 + \beta_1 X_{it} + \epsilon_{it}$, many times in literature it is assumed that $\epsilon_{it} \sim N(0,\sigma^2).$ This assumptions makes sense if we have a large data set due to the central limit theorem. My questions is that in certain situations I feel the error term being normally distributed is the wrong assumption. Suppose $Y_{it}$ is a bounded variable, such as age of a person, or a exam score of a student. Then if $\epsilon_{it} \sim N(0,\sigma^2)$ in this situations where $Y_{it}$ is bounded, is it not possible for the error term to be such that it forces $Y_{it}$ out of its bounds? For example suppose $Y_{it}$ represents a persons age, if the error term is normally distributed, then a random event could occur so it is possible for a person to live say a 1000 years?

Hence, how do we fix this issue with the error term when our dependent variable on the left side on the linear equation is bounded. We could choose another bounded distribution for the error term, such as the uniform distribution over the bounds of $Y_{it}$. However this would not be realistic since it would imply all events in the error term are equally likely to occur. I am interested to here people thoughts about this problem.

Edit: From reading all the great answers and comments below, here is what I have to say. Would it be practical to impose a bounded domain distribution on $\epsilon_{it}?$ For example the triangle density over a particular domain that $Y_{it}$ is in. Would imposing these types of distribution which have a bounded domain and resemble the normal distribution have any disadvantages?

Best Answer

it seems that you're confused about relation of the sample size to CLT application. the distribution of $\epsilon_{it}$ has nothing to do with the sample size. I'm assuming that subscript $i$ refers to the subject (a person), and a subscript $t$ refers to the tume of othe observation.

in a simple linear regression we don't make a lot of assumptions about $\epsilon$ to estimate $\beta_i$. the errors don't have to be normal, and with increasing sample size they will not tend to become normal.

CLT is applied in two different ways:

  • when a sample size increases then the distribution of an estimate of $\beta_i$ which is often denoted as $\hat{\beta}_i$ will tend to become normal, i.e. $\hat{\beta}_i\sim\mathcal{N}(0,\sigma_\beta)$, where $\sigma_\beta$ is a function of $\sigma$. Again, we do not require $\epsilon_{it}\sim\mathcal{N}(0,\sigma)$, we only need $var[\epsilon_{it}]=\sigma$ for this. This is one of large sample properties of linear regressions.
  • often times when we deal with physical experiments, one could argue that there are many sources of errors, when they all add up, they make $\epsilon_{it}$ - a single observation noise - distributed normally. this is not related to the sample size, this is simply sources of errors influencing a single observation. in this case we often make a reasonable assumption of $\epsilon_{it}\sim\mathcal{N}(0,\sigma)$
Related Question