Let's consider a very simple model: $y = \beta x + e$, with an L1 penalty on $\hat{\beta}$ and a least-squares loss function on $\hat{e}$. We can expand the expression to be minimized as:
$\min y^Ty -2 y^Tx\hat{\beta} + \hat{\beta} x^Tx\hat{\beta} + 2\lambda|\hat{\beta}|$
Keep in mind this is a univariate example, with $\beta$ and $x$ being scalars, to show how LASSO can send a coefficient to zero. This can be generalized to the multivariate case.
Let us assume the least-squares solution is some $\hat{\beta} > 0$, which is equivalent to assuming that $y^Tx > 0$, and see what happens when we add the L1 penalty. With $\hat{\beta}>0$, $|\hat{\beta}| = \hat{\beta}$, so the penalty term is equal to $2\lambda\beta$. The derivative of the objective function w.r.t. $\hat{\beta}$ is:
$-2y^Tx +2x^Tx\hat{\beta} + 2\lambda$
which evidently has solution $\hat{\beta} = (y^Tx - \lambda)/(x^Tx)$.
Obviously by increasing $\lambda$ we can drive $\hat{\beta}$ to zero (at $\lambda = y^Tx$). However, once $\hat{\beta} = 0$, increasing $\lambda$ won't drive it negative, because, writing loosely, the instant $\hat{\beta}$ becomes negative, the derivative of the objective function changes to:
$-2y^Tx +2x^Tx\hat{\beta} - 2\lambda$
where the flip in the sign of $\lambda$ is due to the absolute value nature of the penalty term; when $\beta$ becomes negative, the penalty term becomes equal to $-2\lambda\beta$, and taking the derivative w.r.t. $\beta$ results in $-2\lambda$. This leads to the solution $\hat{\beta} = (y^Tx + \lambda)/(x^Tx)$, which is obviously inconsistent with $\hat{\beta} < 0$ (given that the least squares solution $> 0$, which implies $y^Tx > 0$, and $\lambda > 0$). There is an increase in the L1 penalty AND an increase in the squared error term (as we are moving farther from the least squares solution) when moving $\hat{\beta}$ from $0$ to $ < 0$, so we don't, we just stick at $\hat{\beta}=0$.
It should be intuitively clear the same logic applies, with appropriate sign changes, for a least squares solution with $\hat{\beta} < 0$.
With the least squares penalty $\lambda\hat{\beta}^2$, however, the derivative becomes:
$-2y^Tx +2x^Tx\hat{\beta} + 2\lambda\hat{\beta}$
which evidently has solution $\hat{\beta} = y^Tx/(x^Tx + \lambda)$. Obviously no increase in $\lambda$ will drive this all the way to zero. So the L2 penalty can't act as a variable selection tool without some mild ad-hockery such as "set the parameter estimate equal to zero if it is less than $\epsilon$".
Obviously things can change when you move to multivariate models, for example, moving one parameter estimate around might force another one to change sign, but the general principle is the same: the L2 penalty function can't get you all the way to zero, because, writing very heuristically, it in effect adds to the "denominator" of the expression for $\hat{\beta}$, but the L1 penalty function can, because it in effect adds to the "numerator".
Suppose you have two highly correlated predictor variables $x,z$, and suppose both are centered and scaled (to mean zero, variance one). Then the ridge penalty on the parameter vector is $\beta_1^2 + \beta_2^2$ while the lasso penalty term is $ \mid \beta_1 \mid + \mid \beta_2 \mid$. Now, since the model is supposed highly colinear, so that $x$ and $z$ more or less can substitute each other in predicting $Y$, so many linear combination of $x, z$ where we simply substitute in part $x$ for $z$, will work very similarly as predictors, for example $0.2 x + 0.8 z, 0.3 x + 0.7 z$ or $0.5 x + 0.5 z$ will be about equally good as predictors. Now look at these three examples, the lasso penalty in all three cases are equal, it is 1, while the ridge penalty differ, it is respectively 0.68, 0.58, 0.5, so the ridge penalty will prefer equal weighting of colinear variables while lasso penalty will not be able to choose. This is one reason ridge (or more generally, elastic net, which is a linear combination of lasso and ridge penalties) will work better with colinear predictors: When the data give little reason to choose between different linear combinations of colinear predictors, lasso will just "wander" while ridge tends to choose equal weighting. That last might be a better guess for use with future data! And, if that is so with present data, could show up in cross validation as better results with ridge.
We can view this in a Bayesian way: Ridge and lasso implies different prior information, and the prior information implied by ridge tend to be more reasonable in such situations. (This explanation here I learned , more or less, from the book: "Statistical Learning with Sparsity The Lasso and Generalizations" by Trevor Hastie, Robert Tibshirani and Martin Wainwright, but at this moment I was not able to find a direct quote).
But the OP seems to have a different problem:
However, my results show that the mean absolute error of Lasso or
Elastic is around 0.61 whereas this score is 0.97 for the
ridge regression
Now, lasso is also effectively doing variable selection, it can set some coefficients exactly to zero. Ridge cannot do that (except with probability zero.) So it might be that with the OP data, among the colinear variables, some are effective and others don't act at all (and the degree of colinearity sufficiently low that this can be detected.) See When should I use lasso vs ridge? where this is discussed. A detailed analysis would need more information than is given in the question.
Best Answer
If you order 1 million ridge-shrunk, scaled, but non-zero features, you will have to make some kind of decision: you will look at the n best predictors, but what is n? The LASSO solves this problem in a principled, objective way, because for every step on the path (and often, you'd settle on one point via e.g. cross validation), there are only m coefficients which are non-zero.
Very often, you will train models on some data and then later apply it to some data not yet collected. For example, you could fit your model on 50.000.000 emails and then use that model on every new email. True, you will fit it on the full feature set for the first 50.000.000 mails, but for every following email, you will deal with a much sparser and faster, and much more memory efficient, model. You also won't even need to collect the information for the dropped features, which may be hugely helpful if the features are expensive to extract, e.g. via genotyping.
Another perspective on the L1/L2 problem exposed by e.g. Andrew Gelman is that you often have some intuition what your problem may be like. In some circumstances, it is possible that reality is truly sparse. Maybe you have measured millions of genes, but it is plausible that only 30.000 of them actually determine dopamine metabolism. In such a situation, L1 arguably fits the problem better.
In other cases, reality may be dense. For example, in psychology, "everything correlates (to some degree) with everything" (Paul Meehl). Preferences for apples vs. oranges probably does correlate with political leanings somehow - and even with IQ. Regularization might still make sense here, but true zero effects should be rare, so L2 might be more appropriate.