Let's consider a very simple model: $y = \beta x + e$, with an L1 penalty on $\hat{\beta}$ and a least-squares loss function on $\hat{e}$. We can expand the expression to be minimized as:
$\min y^Ty -2 y^Tx\hat{\beta} + \hat{\beta} x^Tx\hat{\beta} + 2\lambda|\hat{\beta}|$
Keep in mind this is a univariate example, with $\beta$ and $x$ being scalars, to show how LASSO can send a coefficient to zero. This can be generalized to the multivariate case.
Let us assume the least-squares solution is some $\hat{\beta} > 0$, which is equivalent to assuming that $y^Tx > 0$, and see what happens when we add the L1 penalty. With $\hat{\beta}>0$, $|\hat{\beta}| = \hat{\beta}$, so the penalty term is equal to $2\lambda\beta$. The derivative of the objective function w.r.t. $\hat{\beta}$ is:
$-2y^Tx +2x^Tx\hat{\beta} + 2\lambda$
which evidently has solution $\hat{\beta} = (y^Tx - \lambda)/(x^Tx)$.
Obviously by increasing $\lambda$ we can drive $\hat{\beta}$ to zero (at $\lambda = y^Tx$). However, once $\hat{\beta} = 0$, increasing $\lambda$ won't drive it negative, because, writing loosely, the instant $\hat{\beta}$ becomes negative, the derivative of the objective function changes to:
$-2y^Tx +2x^Tx\hat{\beta} - 2\lambda$
where the flip in the sign of $\lambda$ is due to the absolute value nature of the penalty term; when $\beta$ becomes negative, the penalty term becomes equal to $-2\lambda\beta$, and taking the derivative w.r.t. $\beta$ results in $-2\lambda$. This leads to the solution $\hat{\beta} = (y^Tx + \lambda)/(x^Tx)$, which is obviously inconsistent with $\hat{\beta} < 0$ (given that the least squares solution $> 0$, which implies $y^Tx > 0$, and $\lambda > 0$). There is an increase in the L1 penalty AND an increase in the squared error term (as we are moving farther from the least squares solution) when moving $\hat{\beta}$ from $0$ to $ < 0$, so we don't, we just stick at $\hat{\beta}=0$.
It should be intuitively clear the same logic applies, with appropriate sign changes, for a least squares solution with $\hat{\beta} < 0$.
With the least squares penalty $\lambda\hat{\beta}^2$, however, the derivative becomes:
$-2y^Tx +2x^Tx\hat{\beta} + 2\lambda\hat{\beta}$
which evidently has solution $\hat{\beta} = y^Tx/(x^Tx + \lambda)$. Obviously no increase in $\lambda$ will drive this all the way to zero. So the L2 penalty can't act as a variable selection tool without some mild ad-hockery such as "set the parameter estimate equal to zero if it is less than $\epsilon$".
Obviously things can change when you move to multivariate models, for example, moving one parameter estimate around might force another one to change sign, but the general principle is the same: the L2 penalty function can't get you all the way to zero, because, writing very heuristically, it in effect adds to the "denominator" of the expression for $\hat{\beta}$, but the L1 penalty function can, because it in effect adds to the "numerator".
I don't believe there is anything wrong with using LASSO for variable selection and then using OLS. From "Elements of Statistical Learning" (pg. 91)
...the lasso shrinkage causes the estimates of the non-zero coefficients to be biased towards zero and in general they are not consistent [Added Note: This means that, as the sample size grows, the coefficient estimates do not converge]. One approach for reducing this bias is to run the lasso to identify the set of non-zero coefficients, and then fit an un-restricted linear model to the selected set of features. This is not always feasible, if the selected set is large. Alternatively, one can use the lasso to select the set of non-zero predictors, and then apply the lasso again, but using only the selected predictors from the first step. This is known as the relaxed lasso (Meinshausen, 2007). The idea is to use cross-validation to estimate the initial penalty parameter for the lasso, and then again for a second penalty parameter applied to the selected set of predictors. Since the variables in the second step have less "competition" from noise variables, cross-validation will tend to pick a smaller value for $\lambda$ [the penalty parameter], and hence their coefficients will be shrunken less than those in the initial estimate.
Another reasonable approach similar in spirit to the relaxed lasso, would be to use lasso once (or several times in tandem) to identify a group of candidate predictor variables. Then use best subsets regression to select the best predictor variables to consider (also see "Elements of Statistical Learning" for this). For this to work, you would need to refine the group of candidate predictors down to around 35, which won't always be feasible. You can use cross-validation or AIC as a criterion to prevent over-fitting.
Best Answer
Intuitively speaking, the group lasso can be preferred to the lasso since it provides a means for us to incorporate (a certain type of) additional information into our estimate for the true coefficient $\beta^*$. As an extreme scenario, considering the following:
With $y \sim \mathcal{N} (X \beta^*, \sigma^2 I )$, put $S = \{j : \beta^*_j \neq 0 \}$ as the support of $\beta^*$. Consider the "oracle" estimator $$\hat{\beta} = \arg\min_{\beta} \|y - X \beta\|_2^2 + \lambda \left( |S|^{1/2} \|\beta_S\|_2 + (p-|S|)^{1/2} \|\beta_{S^C}\|_2 \right),$$ which is the group lasso with two groups--one the true support and one the complement. Let $\lambda_{max}$ be the smallest value of $\lambda$ that makes $\hat{\beta} = 0$. Due to the nature of the group lasso penalty, we know that at $\lambda$ moves from $\lambda_{max}$ to $\lambda_{max} - \epsilon$ (for some small $\epsilon > 0$), exactly one group will enter into support of $\hat{\beta}$, which is popularly considered as an estimate for $S$. Due do our grouping, with high probability, the selected group will be $S$, and we'll have done a perfect job.
In practice, we don't select the groups this well. However, the groups, despite being finer than the extreme scenario above, will still help us: the choice would still be made between a group of true covariates and a group of untrue covariates. We're still borrowing strength.
This is formalized here. They show, under some conditions, that the an upper bound on the prediction error of the group lasso is lower than a lower bound on the prediction error of the plain lasso. That is, they proved that the grouping makes our estimation do better.
For your second question: The (plain) lasso penalty is piecewise linear, and this gives rise to the piecewise linear solution path. Intuitively, in the group lasso case, the penalty is no longer piecewise linear, so we no longer have this property. A great reference on piecewise linearity of solution paths is here. See their proposition 1. Let $L(\beta) = \|y - X \beta\|_2^2$ and $J(\beta) = \sum_{g \in G} |g|^{1/2} \|\beta_g\|_2$. They show that the solution path of the group lasso is linear if and only if $$\left( \nabla^2L(\hat{\beta}) + \lambda \nabla^2 J(\hat{\beta}) \right)^{-1} \nabla J(\hat{\beta})$$ is piecewise constant. Of course, it isn't since our penalty $J$ has global curvature.