First, categorizing continuous variables is generally a bad idea; Royston, Altman and Saurbrei wrote a good article on why dichotomizing is bad, and the same arguments apply to more categories. Altman wrote an article on categorizing variables, but only the abstract is freely available, and I have not read the whole article.
Second, the assumptions of linear regression are not that the dependent variable is normally distributed, but that the residuals from the model are. So, before you can see if your model violates the assumptions, you need to run it and look at the results.
Third, if the residuals are not normally distributed, you have several choices:
- Multinomial logistic regression with the four categories you list
- Ordinal logistic regression with "mixed" excluded.
- Looking at each category separately
- Some sort of robust regression
Before doing any of these, my impulse would be to look at the variables graphically, with density plots and possibly quantile normal plots.
This may be better appreciated by expressing the result of CLT in terms of sums of iid random variables. We have
$$\sqrt{n} \frac{ \bar{X} -\mu}{\sigma} \sim N(0, 1) \quad \text{asymptotically}$$
Multiply the quotient by $\frac{\sigma}{\sqrt{n}}$ and use the fact that $Var(cX) = c^2 Var(X)$ to get
$$\bar{X}-\mu \sim N\left(0, \frac{\sigma^2}{n} \right)$$
Now add $\mu$ to the LHS and use the fact that $\mathbb{E} \left[a X+\mu\right] = a \mathbb{E}[X] + \mu$ to obtain
$$\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i \sim N\left(\mu, \frac{\sigma^2}{n} \right)$$
Lastly, multiply by $n$ and use the above two results to see that
$$\sum_{i=1}^n X_i \sim N \left(n \mu, n\sigma^2 \right) $$
And what does this have to do with Wooldridge's statement? Well, if the error is the sum of many iid random variables then it will be approximately normally distributed, as just seen. But there is an issue here, namely that the unobserved factors will not necessarily be identically distributed and they might not even be independent!
Nevertheless, the CLT has been successfully extended to independent non-identically distributed random variables and even cases of mild dependence, under some additional regularity conditions. These are essentially conditions that guarantee that no term in the sum exerts disproportional influence on the asymptotic distribution, see also the wikipedia page on the CLT. You do not need to know these results of course; Wooldridge's aim is merely to provide intuition.
Hope this helps.
Best Answer
I think you've basically hit the nail on the head in the question, but I'll see if I can add something anyway. I'm going to answer this in a bit of a roundabout way ...
The field of Robust Statistics examines the question of what to do when the Gaussian assumption fails (in the sense that there are outliers):
These have been applied in ML too, for example in Mika el al. (2001) A Mathematical Programming Approach to the Kernel Fisher Algorithm, they describe how Huber's Robust Loss can be used with KDFA (along with other loss functions). Of course this is a classification loss, but KFDA is closely related to the Relevance Vector Machine (see section 4 of the Mika paper).
As implied in the question, there is a close connection between loss functions and Bayesian error models (see here for a discussion).
However it tends to be the case that as soon as you start incorporating "funky" loss functions, optimisation becomes tough (note that this happens in the Bayesian world too). So in many cases people resort to standard loss functions that are easy to optimise, and instead do extra pre-processing to ensure that the data conforms to the model.
The other point that you mention is that the CLT only applies to samples that are IID. This is true, but then the assumptions (and the accompanying analysis) of most algorithms is the same. When you start looking at non-IID data, things get a lot more tricky. One example is if there is temporal dependence, in which case typically the approach is to assume that the dependence only spans a certain window, and samples can therefore be considered approximately IID outside of this window (see for example this brilliant but tough paper Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes), after which normal analysis can be applied.
So, yes, it comes down in part to convenience, and in part because in the real world, most errors do look (roughly) Gaussian. One should of course always be careful when looking at a new problem to make sure that the assumptions aren't violated.