I don't know about the Garrote, but LASSO is preferred over ridge regression when the solution is believed to have sparse features because L1 regularization promotes sparsity while L2 regularization does not, and Elastic Net is preferred over LASSO because it can deal with situations when the number of features is greater than the number of samples, and with correlated features, where LASSO behaves erratically. The additional L2 terms as a preconditioner or stabilizer by introducing strong convexity, but you'll have to read about convex optimziation to appreciate that. I think the original paper by Hastie and Zou explains all this clearly, and is worth reading.
Generally speaking you need at least $p$ points to determine $p$ free parameters.
Consider the following simple situation: a logistic regression model, $P[Y=a|x]=\frac{\exp(\alpha+\beta x)}{1+\exp(\alpha+\beta x)}$ or equivalently $\text{logit}(P[Y=1|x])=\alpha+\beta x$, where we only have data at one x-value, say x=5; for which we have two binary outcomes, $y_1=0$ and $y_2=1$; the proportion of $1$'s at $x=5$ is $\frac{1}{2}$.
An infinite number of different logistic functions can be fitted through that point:
Indeed, any logistic curve which has $\beta = -\alpha/5$ will go through the point. With no data at all, any point in $\mathbf{R}^2$ would be possible. By adding a point $(x_1,y_1)$, a perfect fit can be obtained by any point in the one-dimensional space $\alpha+x_1\beta=\text{logit}(y_1)$ (if we added information at a second value of $x$, the subspace in which $(\alpha, \beta)$ could lie would reduce again, to 0-dimensions (that is - usually - data at two x-values are enough to determine two parameters).
Note that although we're fitting a curve, the equation defining the smaller subspace as we add points is linear in the parameters.
As such, the issues are the same as when $p>n$ in multiple regression.
The answers here and here are therefore relevant.
[If some appropriate form of regularization, constraint or additional criteria are applied, it's generally possible to identify a unique member of the subspace, which is to say the infinite number of solutions can be reduced to a single one.]
Best Answer
Yes, Regularization can be used in all linear methods, including both regression and classification. I would like to show you that there are not too much difference between regression and classification: the only difference is the loss function.
Specifically, there are three major components of linear method, Loss Function, Regularization, Algorithms. Where loss function plus regularization is the objective function in the problem in optimization form and the algorithm is the way to solve it (the objective function is convex, we will not discuss in this post).
In loss function setting, we can have different loss in both regression and classification cases. For example, Least squares and least absolute deviation loss can be used for regression. And their math representation are $L(\hat y,y)=(\hat y -y)^2$ and $L(\hat y,y)=|\hat y -y|$. (The function $L( \cdot ) $ is defined on two scalar, $y$ is ground truth value and $\hat y$ is predicted value.)
On the other hand, logistic loss and hinge loss can be used for classification. Their math representations are $L(\hat y, y)=\log (1+ \exp(-\hat y y))$ and $L(\hat y, y)= (1- \hat y y)_+$. (Here, $y$ is the ground truth label in $\{-1,1\}$ and $\hat y$ is predicted "score". The definition of $\hat y$ is a little bit unusual, please see the comment section.)
In regularization setting, you mentioned about the L1 and L2 regularization, there are also other forms, which will not be discussed in this post.
Therefore, in a high level a linear method is
$$\underset{w}{\text{minimize}}~~~ \sum_{x,y} L(w^{\top} x,y)+\lambda h(w)$$
If you replace the Loss function from regression setting to logistic loss, you get the logistic regression with regularization.
For example, in ridge regression, the optimization problem is
$$\underset{w}{\text{minimize}}~~~ \sum_{x,y} (w^{\top} x-y)^2+\lambda w^\top w$$
If you replace the loss function with logistic loss, the problem becomes
$$\underset{w}{\text{minimize}}~~~ \sum_{x,y} \log(1+\exp{(-w^{\top}x \cdot y)})+\lambda w^\top w$$
Here you have the logistic regression with L2 regularization.
This is how it looks like in a toy synthesized binary data set. The left figure is the data with the linear model (decision boundary). The right figure is the objective function contour (x and y axis represents the values for 2 parameters.). The data set was generated from two Gaussian, and we fit the logistic regression model without intercept, so there are only two parameters we can visualize in the right sub-figure.
The blue lines are the logistic regression without regularization and the black lines are logistic regression with L2 regularization. The blue and black points in right figure are optimal parameters for objective function.
In this experiment, we set a large $\lambda$, so you can see two coefficients are close to $0$. In addition, from the contour, we can observe the regularization term is dominated and the whole function is like a quadratic bowl.
Here is another example with L1 regularization.
Note that, the purpose of this experiment is trying to show how the regularization works in logistic regression, but not argue regularized model is better.
Here are some animations about L1 and L2 regularization and how it affects the logistic loss objective. In each frame, the title suggests the regularization type and $\lambda$, the plot is objective function (logistic loss + regularization) contour. We increase the regularization parameter $\lambda$ in each frame and the optimal solution will shrink to $0$ frame by frame.
Some notation comments. $w$ and $x$ are column vectors,$y$ is a scalar. So the linear model $\hat y = f(x)=w^\top x$. If we want to include the intercept term, we can append $1$ as a column to the data.
In regression setting, $y$ is a real number and in classification setting $y \in \{-1,1\}$.
Note it is a little bit strange for the definition of $\hat y=w^{\top} x$ in classification setting. Since most people use $\hat y$ to represent a predicted value of $y$. In our case, $\hat y = w^{\top} x$ is a real number, but not in $\{-1,1\}$. We use this definition of $\hat y$ because we can simplify the notation on logistic loss and hinge loss.
Also note that, in some other notation system, $y \in \{0,1\}$, the form of the logistic loss function would be different.
The code can be found in my other answer here.
Is there any intuitive explanation of why logistic regression will not work for perfect separation case? And why adding regularization will fix it?