It would be difficult to be clearer than what has been said for the other posts. Nevertheless I will try to say something to the point that addresses the different assumptions that are needed for OLS and various other estimation techniques to be appropriate to use.
OLS estimation: This is applied in both simple linear and mutliple regression where the common assumptions are (1) the model is linear in the coefficients of the predictor with an additive random error term (2) the random error terms are (a) normally distributed with 0 mean and (b) a variance that doesn't change as the values of the predictor covariates (i.e. IVs) change, Note also that in this framework which applies in both simple and multiple regression the covariates are assumed to be known without any uncertainty in their given values. OLS can be used when either A) only (1) holds with 2(b) or B) both (1) and (2) hold.
If B) can be assumed OLS has some nice properties that make it attractive to use.
(I) MINIMUM VARIANCE AMONG UNBIASED ESTIMATORS
(II) MAXIMUM LIKELIHOOD
(III) CONSISTENT AND ASYMPTOTICALLY NORMALITY AND EFFICIENCY UNDER CERTAIN REGULARITY CONDITIONS
Under B) OLS can be used for both estimation and predictions and both confidence and prediction intervals can be generated for the fitted values and predictions.
IF only A) holds we still have property (I) but not (II) or (III). If your objective is to fit the model and you don't need confidence or prediction interval for the repsonse given the covariate and you don't need confidence intervals for the regression parameters then OLS can be used under A). But you cannot test for significance of the coefficients in the model using the t tests that are often used nor can you apply the F test for overall model fit or the one for equality of variances. But the Gauss-Markov theorem tells you that property I still holds. However in case A) since (II) and (III) no longer hold other more robust estimation procedures may be better than least squares even though they are not unbiased. This is particularly true when the error distribution is heavytailed and you see outliers in the data. The least squares estimates are very sensitive to outliers.
What else can go wrong with using OLS?
Error variances not homogeneous means a weighted least squares method may be preferable to OLS.
High degree of collinearity among predictors means that either some predictors should be removed or another estimation procedure such as ridge regression should be used. The OLS estimated coefficients can be highly unstable when there is a high degree of multicollinearity.
If the covariates are observed with error (e.g. measurement error) then the model assumption that the covariates are given without error is violated. This is bad for OLS because the criteria minimizes the residuals in the direction of the response variable assuming no error to worry about in the direction of the covariates. This is called the error in variables problem and a solution that takes account of these errors in the covariate directions will do better. Error in variables (aka Deming) regression minimizes the sum of squared deviations in a direction that takes account of the ratios of these variances.
This is a little complicated because many assumptions are involved in these models and objectives play a role in deciding which assumptions are crucial for a given analysis. But if you focus on the properties one at a time to see the consequences of the violation of an assumption it might be less confusing.
Building hierarchical models is all about comparing groups. The power of the model is that you can treat the information about a particular group as evidence relating how that group compares to the aggregate behavior for a particular level, so if you don't have a lot of information about a single group, that group gets pushed towards the mean for the level. Here's an example:
Let's say we wanted to build a linear model describing student literacy (perhaps as a function of grade-level and socioeconomic status) for a region. What's the best way to go about this? One naive way would be to just treat all the students in the region as one big group and calculate an OLS model for literacy rates at each grade level. There's nothing exactly wrong with this, but let's say that for a particular student, we know that they attend an especially good school out in the burbs. Is it really fair to apply the county-wide average literacy for their grade to this student? Of course not, their literacy will probably be higher than average because of our observation about their school. So as an alternative, we could develop a separate model for each school. This is great for big schools, but again: what about those small private schools? If we only have 15 kids in a class, we're probably not going to have a very accurate model.
Hierarchical models allow us to do both simultaneously. At one level, we calculate the literacy rate for the entire region. At another level, we calculate the school-specific literacy rates. The less information we have about a particular school, the more closely it will approximate the across-school mean. This also allows us to step up the model to consider other school districts, and maybe even go a level higher to compare literacy between states or even consider differences between countries. Anything going on all the way up at the country level won't have a huge impact all the way down at the county level because there are so many levels in between, but information is information and we should allow it the opportunity to influence our results, especially where we have very little data.
So if we have very little data on a particular school, but we know how schools in that country, state, and county generally behave, we can make some informed inferences about that school and treat new information as evidence against our beliefs informed by the larger groups (the higher levels in the hierarchy).
Best Answer
OLS, conditional expectation and linear projection are all related. It helps to distinguish between the unknown data generating process (the model) and procedures to estimate the parameters of that model.
Let this be model/data generating process. $f$ is some unknown function.
$y_i = f(x_i, \theta) +\epsilon_i$, $E[x_i\epsilon]=0$
We could use OLS, and regress $y_i$ on vector $x_i$. The OLS estimator is defined to be the vector $b$ that minimises the sample sum of squares $(y-Xb)^T(y-Xb)$ ( $y$ is $n \times 1$, $X$ is $n \times k$ ).
As the sample size $n$ gets larger, $b$ will converge to something (in probability). Whether it converges to $\beta$, though, depends on what the true model/dgp actually is, ie on $f$.
Suppose $f$ really is linear. Then $y_i = x_i^T\theta +\epsilon_i$ and $E[y_i|x_i]=x_i^T\theta$ and $b$ converges to $\theta$.
What if $f$ isn't linear? $b$ still converges to something, the thing it always converges to: the linear projection coefficient. What is a linear projection? Is is the population equivalent of the OLS estimator. The vector $\beta$ that minimises $E[ (y_i-x_i^T\beta)^T (y_i-x_i^T\beta)]$. Regardless of what the true relation between y and x is, this vector exists and OLS converges to it.
In the special case where the conditional expectation is linear, $\theta$ and $\beta$ are the same, and OLS recovers the conditional expectation function for you as the sample grows. If that function is not linear, OLS recovers just the linear projection coefficient for you, which could still be useful, because it is the mean square error minimising linear approximation of the conditional expectation function.