There is a new paper, A Significance Test for the Lasso, including the inventor of LASSO as an author that reports results on this problem. This is a relatively new area of research, so the references in the paper cover a lot of what is known at this point.
As for your second question, have you tried $\alpha \in (0,1)$? Often there is a value in this middle range that achieves a good compromise. This is called Elastic Net regularization. Since you are using cv.glmnet, you will probably want to cross-validate over a grid of $(\lambda, \alpha)$ values.
LASSO differs from best-subset selection in terms of penalization and path dependence.
In best-subset selection, presumably CV was used to identify that 2 predictors gave the best performance. During CV, full-magnitude regression coefficients without penalization would have been used for evaluating how many variables to include. Once the decision was made to use 2 predictors, then all combinations of 2 predictors would be compared on the full data set, in parallel, to find the 2 for the final model. Those 2 final predictors would be given their full-magnitude regression coefficients, without penalization, as if they had been the only choices all along.
You can think of LASSO as starting with a large penalty on the sum of the magnitudes of the regression coefficients, with the penalty gradually relaxed. The result is that variables enter one at a time, with a decision made at each point during the relaxation whether it's more valuable to increase the coefficients of the variables already in the model, or to add another variable. But when you get, say, to a 2-variable model, the regression coefficients allowed by LASSO will be lower in magnitude than those same variables would have in the standard non-penalized regressions used to compare 2-variable and 3-variable models in best-subset selection.
This can be thought of as making it easier for new variables to enter in LASSO than in best-subset selection. Heuristically, LASSO trades off potentially lower-than-actual regression coefficients against the uncertainty in how many variables should be included. This would tend to include more variables in a LASSO model, and potentially worse performance for LASSO if you knew for sure that only 2 variables needed to be included. But if you already knew how many predictor variables should be included in the correct model, you probably wouldn't be using LASSO.
Nothing so far has depended on collinearity, which leads different types of arbitrariness in variable selection in best-subset versus LASSO. In this example, best-subset examined all possible combinations of 2 predictors and chose the best among those combinations. So the best 2 for that particular data sample win.
LASSO, with its path dependence in adding one variable at a time, means that an early choice of one variable may influence when other variables correlated to it enter later in the relaxation process. It's also possible for a variable to enter early and then for its LASSO coefficient to drop as other correlated variables enter.
In practice, the choice among correlated predictors in final models with either method is highly sample dependent, as can be checked by repeating these model-building processes on bootstrap samples of the same data. If there aren't too many predictors, and your primary interest is in prediction on new data sets, ridge regression, which tends to keep all predictors, may be a better choice.
Best Answer
While in some sense this is true, do not take it to mean that gbm is a miracle worker. If the noise to signal ratio in your data is high, and you give gbm enough chances to mistake noise for signal, it will do so.
Here's an example. I'll generate $50$ predictors independent of a random response, and then let gbm look for a relationship
Even though there is no signal to find here, gbm still is able to decrease the cross validation error from the null model
gbm is telling me the optimal tree depth here is about 600, although the true value is zero. Notice how cross validation (or out of sample testing in your case) is not protecting me here, as sometimes the same patterns appear by chance in the training and hold out data. Notice also that this happens even though the training data is truly uncorrelated.
Of course, one way to alleviate this is to fit the model on more data, and this does indeed rectify the problem
There is never a free lunch. If you pass the gbm algorithm predictors that bear a weak or no relationship to the response, you are always risking it finding false patterns.
Attempts to find an optimal subset of predictors are likely to make this problem worse rather than better. Pre-screening can cause severe optimism in your model, and very quickly hurt its out of sample performance. I always recommend reading the section of Elements of Statistical Learning on this point titled "The Wrong and Right Way to do Cross Validation".