If we go for a simple answer, the excerpt from the Wooldridge book (page 533) is very appropriate:
... both heteroskedasticity and nonnormality result in the Tobit estimator $\hat{\beta}$ being inconsistent for $\beta$. This inconsistency occurs because the derived density of $y$ given $x$ hinges crucially on $y^*|x\sim\mathrm{Normal}(x\beta,\sigma^2)$. This nonrobustness of the Tobit estimator shows that data censoring can be very costly: in the absence of censoring ($y=y^*$) $\beta$ could be consistently estimated under $E(u|x)=0$ [or even $E(x'u)=0$].
The notations in this excerpt comes from Tobit model:
\begin{align}
y^{*}&=x\beta+u, \quad u|x\sim N(0,\sigma^2)\\
y^{*}&=\max(y^*,0)
\end{align}
where $y$ and $x$ are observed.
To sum up the difference between least squares and Tobit regression is the inherent assumption of normality in the latter.
Also I always thought that the original article of Amemyia was quite nice in laying out the theoretical foundations of the Tobit regression.
The answer to both 1 and 2 is no, but care is needed in interpreting the existence theorem.
Variance of Ridge Estimator
Let $\hat{\beta^*}$ be the ridge estimate under penalty $k$, and let $\beta$ be the true parameter for the model $Y = X \beta + \epsilon$. Let $\lambda_1, \dotsc, \lambda_p$ be the eigenvalues of $X^T X$.
From Hoerl & Kennard equations 4.2-4.5, the risk, (in terms of the expected $L^2$ norm of the error) is
$$
\begin{align*}
E \left( \left[ \hat{\beta^*} - \beta \right]^T \left[ \hat{\beta^*} - \beta \right] \right)& = \sigma^2 \sum_{j=1}^p \lambda_j/ \left( \lambda_j +k \right)^2 + k^2 \beta^T \left( X^T X + k \mathbf{I}_p \right)^{-2} \beta \\
& = \gamma_1 (k) + \gamma_2(k) \\
& = R(k)
\end{align*}
$$
where as far as I can tell, $\left( X^T X + k \mathbf{I}_p \right)^{-2} = \left( X^T X + k \mathbf{I}_p \right)^{-1} \left( X^T X + k \mathbf{I}_p \right)^{-1}.$ They remark that $\gamma_1$ has the interpretation of the variance of the inner product of $\hat{\beta^*} - \beta$, while $\gamma_2$ is the inner product of the bias.
Supposing $X^T X = \mathbf{I}_p$, then
$$R(k) = \frac{p \sigma^2 + k^2 \beta^T \beta}{(1+k)^2}.$$
Let
$$R^\prime (k) = 2\frac{k(1+k)\beta^T \beta - (p\sigma^2 + k^2 \beta^T \beta)}{(1+k)^3}$$ be the derivative of the risk w/r/t $k$.
Since $\lim_{k \rightarrow 0^+} R^\prime (k) = -2p \sigma^2 < 0$, we conclude that there is some $k^*>0$ such that $R(k^*)<R(0)$.
The authors remark that orthogonality is the best that you can hope for in terms of the risk at $k=0$, and that as the condition number of $X^T X$ increases, $\lim_{k \rightarrow 0^+} R^\prime (k)$ approaches $- \infty$.
Comment
There appears to be a paradox here, in that if $p=1$ and $X$ is constant, then we are just estimating the mean of a sequence of Normal$(\beta, \sigma^2)$ variables, and we know the the vanilla unbiased estimate is admissible in this case. This is resolved by noticing that the above reasoning merely provides that a minimizing value of $k$ exists for fixed $\beta^T \beta$. But for any $k$, we can make the risk explode by making $\beta^T \beta$ large, so this argument alone does not show admissibility for the ridge estimate.
Why is ridge regression usually recommended only in the case of correlated predictors?
H&K's risk derivation shows that if we think that $\beta ^T \beta$ is small, and if the design $X^T X$ is nearly-singular, then we can achieve large reductions in the risk of the estimate. I think ridge regression isn't used ubiquitously because the OLS estimate is a safe default, and that the invariance and unbiasedness properties are attractive. When it fails, it fails honestly--your covariance matrix explodes. There is also perhaps a philosophical/inferential point, that if your design is nearly singular, and you have observational data, then the interpretation of $\beta$ as giving changes in $E Y$ for unit changes in $X$ is suspect--the large covariance matrix is a symptom of that.
But if your goal is solely prediction, the inferential concerns no longer hold, and you have a strong argument for using some sort of shrinkage estimator.
Best Answer
It would be difficult to be clearer than what has been said for the other posts. Nevertheless I will try to say something to the point that addresses the different assumptions that are needed for OLS and various other estimation techniques to be appropriate to use.
OLS estimation: This is applied in both simple linear and mutliple regression where the common assumptions are (1) the model is linear in the coefficients of the predictor with an additive random error term (2) the random error terms are (a) normally distributed with 0 mean and (b) a variance that doesn't change as the values of the predictor covariates (i.e. IVs) change, Note also that in this framework which applies in both simple and multiple regression the covariates are assumed to be known without any uncertainty in their given values. OLS can be used when either A) only (1) holds with 2(b) or B) both (1) and (2) hold.
If B) can be assumed OLS has some nice properties that make it attractive to use. (I) MINIMUM VARIANCE AMONG UNBIASED ESTIMATORS (II) MAXIMUM LIKELIHOOD (III) CONSISTENT AND ASYMPTOTICALLY NORMALITY AND EFFICIENCY UNDER CERTAIN REGULARITY CONDITIONS
Under B) OLS can be used for both estimation and predictions and both confidence and prediction intervals can be generated for the fitted values and predictions.
IF only A) holds we still have property (I) but not (II) or (III). If your objective is to fit the model and you don't need confidence or prediction interval for the repsonse given the covariate and you don't need confidence intervals for the regression parameters then OLS can be used under A). But you cannot test for significance of the coefficients in the model using the t tests that are often used nor can you apply the F test for overall model fit or the one for equality of variances. But the Gauss-Markov theorem tells you that property I still holds. However in case A) since (II) and (III) no longer hold other more robust estimation procedures may be better than least squares even though they are not unbiased. This is particularly true when the error distribution is heavytailed and you see outliers in the data. The least squares estimates are very sensitive to outliers.
What else can go wrong with using OLS?
Error variances not homogeneous means a weighted least squares method may be preferable to OLS.
High degree of collinearity among predictors means that either some predictors should be removed or another estimation procedure such as ridge regression should be used. The OLS estimated coefficients can be highly unstable when there is a high degree of multicollinearity.
If the covariates are observed with error (e.g. measurement error) then the model assumption that the covariates are given without error is violated. This is bad for OLS because the criteria minimizes the residuals in the direction of the response variable assuming no error to worry about in the direction of the covariates. This is called the error in variables problem and a solution that takes account of these errors in the covariate directions will do better. Error in variables (aka Deming) regression minimizes the sum of squared deviations in a direction that takes account of the ratios of these variances.
This is a little complicated because many assumptions are involved in these models and objectives play a role in deciding which assumptions are crucial for a given analysis. But if you focus on the properties one at a time to see the consequences of the violation of an assumption it might be less confusing.