I don't use ridge regression all that much, so I'll focus on (A) and (C):

(A) While the Lasso is traditionally motivated by the p > n scenario, it is mathematically well-defined when n > p, i.e. the solution exists and is unique assuming your design matrix is sufficiently well-behaved. All the same formulas and error bounds continue to hold when n > p. All the algorithms (at least that I know of) that produce Lasso estimates should also work when n > p.

Most of the time if n > p (especially if p is small) you probably want to think carefully about whether or not the Lasso is your best option. As usual, it is problem dependent. That being said, in some situations the Lasso may be appropriate when n > p: For example, if you have 10,000 predictors and 15,000 observations, it's likely that you will still want some kind of regularization to trim down the number of predictors and kill some of the noise. The Lasso may be helpful here.

(B) Ridge regression can be used in the p > n situation to alleviate singularity issues in the design matrix. This may be useful if sparsity / feature selection is not important. Moreover, ridge regression has a very nice closed form solution that is easily interpreted, and this can be helpful in practice. In essence, you add a positive term to the main diagonal, which improves the regularity of the sample covariance (specifically, it removes vanishing eigenvalues as long as enough regularization is applied).

(I'll leave it the experts to address this one more thoughtfully.)

(C) Soft thresholding and the Lasso are closely related, but not identical. One interpretation of soft thresholding is as the special case of Lasso regression when the predictors are orthogonal, which is of course a restrictive assumption.

Another interpretation of soft thresholding is as the one-at-a-time update in coordinate descent algorithms for the Lasso. I recommend the paper "Pathwise Coordinate Optimization" by Friedman et al for an introduction to these concepts. For a slightly more recent and more general treatment, there is the excellent paper "SparseNet: Coordinate Descent With Nonconvex Penalties" by Mazumder et al.

OLS, conditional expectation and linear projection are all related. It helps to distinguish between the unknown data generating process (the model) and procedures to estimate the parameters of that model.

Let this be model/data generating process. $f$ is some unknown function.

$y_i = f(x_i, \theta) +\epsilon_i$, $E[x_i\epsilon]=0$

We could use OLS, and regress $y_i$ on vector $x_i$. The OLS estimator is defined to be the vector $b$ that minimises the sample sum of squares $(y-Xb)^T(y-Xb)$ ( $y$ is $n \times 1$, $X$ is $n \times k$ ).

As the sample size $n$ gets larger, $b$ will converge to something (in probability). Whether it converges to $\beta$, though, depends on what the true model/dgp actually is, ie on $f$.

Suppose $f$ really is linear. Then
$y_i = x_i^T\theta +\epsilon_i$ and $E[y_i|x_i]=x_i^T\theta$ and $b$ converges to $\theta$.

What if $f$ isn't linear? $b$ still converges to something, the thing it always converges to: the linear projection coefficient. What is a linear projection? Is is the population equivalent of the OLS estimator. The vector $\beta$ that minimises $E[ (y_i-x_i^T\beta)^T (y_i-x_i^T\beta)]$. Regardless of what the true relation between y and x is, this vector exists and OLS converges to it.

In the special case where the conditional expectation is linear, $\theta$ and $\beta$ are the same, and OLS recovers the conditional expectation function for you as the sample grows. If that function is not linear, OLS recovers just the linear projection coefficient for you, which could still be useful, because it is the mean square error minimising linear approximation of the conditional expectation function.

## Best Answer

Building hierarchical models is all about comparing groups. The power of the model is that you can treat the information about a particular group as evidence relating how that group compares to the aggregate behavior for a particular level, so if you don't have a lot of information about a single group, that group gets pushed towards the mean for the level. Here's an example:

Let's say we wanted to build a linear model describing student literacy (perhaps as a function of grade-level and socioeconomic status) for a region. What's the best way to go about this? One naive way would be to just treat all the students in the region as one big group and calculate an OLS model for literacy rates at each grade level. There's nothing exactly

wrongwith this, but let's say that for a particular student, we know that they attend an especially good school out in the burbs. Is it really fair to apply the county-wide average literacy for their grade to this student? Of course not, their literacy will probably be higher than average because of our observation about their school. So as an alternative, we could develop a separate model for each school. This is great for big schools, but again: what about those small private schools? If we only have 15 kids in a class, we're probably not going to have a very accurate model.Hierarchical models allow us to do both simultaneously. At one level, we calculate the literacy rate for the entire region. At another level, we calculate the school-specific literacy rates. The less information we have about a particular school, the more closely it will approximate the across-school mean. This also allows us to step up the model to consider other school districts, and maybe even go a level higher to compare literacy between states or even consider differences between countries. Anything going on all the way up at the country level won't have a

hugeimpact all the way down at the county level because there are so many levels in between, but information is information and we should allow it the opportunity to influence our results, especially where we have very little data.So if we have very little data on a particular school, but we know how schools in that country, state, and county generally behave, we can make some informed inferences about that school and treat new information as evidence against our beliefs informed by the larger groups (the higher levels in the hierarchy).