Like all the comments above, the Bayesian interpretation of LASSO is not taking the expected value of the posterior distribution, which is what you would want to do if you were a purist. If that would be the case, then you would be right that there is very small chance that the posterior would be zero given the data.
In reality, the Bayesian interpretation of LASSO is taking the MAP (Maximum A Posteriori) estimator of the posterior. It sounds like you are familiar, but for anyone who is not, this is basically Bayesian Maximum Likelihood, where you use the value that corresponds to the maximum probability of occurrence (or the mode) as your estimator for the parameters in LASSO. Since the distribution increases exponentially until zero from the negative direction and falls off exponentially in the positive direction, unless your data strongly suggests the beta is some other significant value, the maximum value of value of your posterior is likely to be 0.
Long story short, your intuition seems to be based on the mean of the posterior, but the Bayesian interpretation of LASSO is based on taking the mode of the posterior.
The classic Ridge Regression (Tikhonov Regularization) is given by:
$$ \arg \min_{x} \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} + \lambda {\left\| x \right\|}_{2}^{2} $$
The claim above is that the following problem is equivalent:
$$\begin{align*}
\arg \min_{x} \quad & \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} \\
\text{subject to} \quad & {\left\| x \right\|}_{2}^{2} \leq t
\end{align*}$$
Let's define $ \hat{x} $ as the optimal solution of the first problem and $ \tilde{x} $ as the optimal solution of the second problem.
The claim of equivalence means that $ \forall t, \: \exists \lambda \geq 0 : \hat{x} = \tilde{x} $.
Namely you can always have a pair of $ t $ and $ \lambda \geq 0 $ such the solution of the problem is the same.
How could we find a pair?
Well, by solving the problems and looking at the properties of the solution.
Both problems are Convex and smooth so it should make things simpler.
The solution for the first problem is given at the point the gradient vanishes which means:
$$ \hat{x} - y + 2 \lambda \hat{x} = 0 $$
The KKT Conditions of the second problem states:
$$ \tilde{x} - y + 2 \mu \tilde{x} = 0 $$
and
$$ \mu \left( {\left\| \tilde{x} \right\|}_{2}^{2} - t \right) = 0 $$
The last equation suggests that either $ \mu = 0 $ or $ {\left\| \tilde{x} \right\|}_{2}^{2} = t $.
Pay attention that the 2 base equations are equivalent.
Namely if $ \hat{x} = \tilde{x} $ and $ \mu = \lambda $ both equations hold.
So it means that in case $ {\left\| y \right\|}_{2}^{2} \leq t $ one must set $ \mu = 0 $ which means that for $ t $ large enough in order for both to be equivalent one must set $ \lambda = 0 $.
On the other case one should find $ \mu $ where:
$$ {y}^{t} \left( I + 2 \mu I \right)^{-1} \left( I + 2 \mu I \right)^{-1} y = t $$
This is basically when $ {\left\| \tilde{x} \right\|}_{2}^{2} = t $
Once you find that $ \mu $ the solutions will collide.
Regarding the $ {L}_{1} $ case, well, it works with the same idea.
The only difference is we don't have closed for solution hence deriving the connection is trickier.
Have a look at my answer at StackExchange Cross Validated Q291962 and StackExchange Signal Processing Q21730 - Significance of $ \lambda $ in Basis Pursuit.
Best Answer
Yes, that is correct. Whenever we have an optimisation problem involving maximisation of the log-likelihood function plus a penalty function on the parameters, this is mathematically equivalent to posterior maximisation where the penalty function is taken to be the logarithm of a prior kernel.$^\dagger$ To see this, suppose we have a penalty function $w$ using a tuning parameter $\lambda$. The objective function in these cases can be written as:
$$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta|\lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta|\mathbf{x}, \lambda) + \text{const}, \\[6pt] \end{aligned} \end{equation}$$
where we use the prior $\pi(\theta|\lambda) \propto \exp ( -w(\theta|\lambda))$. Observe here that the tuning parameter in the optimisation is treated as a fixed hyperparameter in the prior distribution. If you are undertaking classical optimisation with a fixed tuning parameter, this is equivalent to undertaking a Bayesian optimisation with a fixed hyper-parameter. For LASSO and Ridge regression the penalty functions and corresponding prior-equivalents are:
$$\begin{equation} \begin{aligned} \text{LASSO Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Laplace} \Big( 0, \frac{1}{\lambda} \Big) = \prod_{k=1}^m \frac{\lambda}{2} \cdot \exp ( -\lambda |\theta_k| ), \\[6pt] \text{Ridge Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Normal} \Big( 0, \frac{1}{2\lambda} \Big) = \prod_{k=1}^m \sqrt{\lambda/\pi} \cdot \exp ( -\lambda \theta_k^2 ). \\[6pt] \end{aligned} \end{equation}$$
The former method penalises the regression coefficients according to their absolute magnitude, which is the equivalent of imposing a Laplace prior located at zero. The latter method penalises the regression coefficients according to their squared magnitude, which is the equivalent of imposing a normal prior located at zero.
So long as the frequentist method can be posed as an optimisation problem (rather than say, including a hypothesis test, or something like this) there will be a Bayesian analogy using an equivalent prior. Just as the frequentists may treat the tuning parameter $\lambda$ as unknown and estimate this from the data, the Bayesian may similarly treat the hyperparameter $\lambda$ as unknown. In a full Bayesian analysis this would involve giving the hyperparameter its own prior and finding the posterior maximum under this prior, which would be analogous to maximising the following objective function:
$$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - h(\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \cdot \exp ( -h(\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta, \lambda|\mathbf{x}) + \text{const}. \\[6pt] \end{aligned} \end{equation}$$
This method is indeed used in Bayesian analysis in cases where the analyst is not comfortable choosing a specific hyperparameter for their prior, and seeks to make the prior more diffuse by treating it as unknown and giving it a distribution. (Note that this is just an implicit way of giving a more diffuse prior to the parameter of interest $\theta$.)
Before proceeding to look at $K$-fold cross-validation, it is first worth noting that, mathematically, the maximum a posteriori (MAP) method is simply an optimisation of a function of the parameter $\theta$ and the data $\mathbf{x}$. If you are willing to allow improper priors then the scope encapsulates any optimisation problem involving a function of these variables. Thus, any frequentist method that can be framed as a single optimisation problem of this kind has a MAP analogy, and any frequentist method that cannot be framed as a single optimisation of this kind does not have a MAP analogy.
In the above form of model, involving a penalty function with a tuning parameter, $K$-fold cross-validation is commonly used to estimate the tuning parameter $\lambda$. For this method you partition the data vector $\mathbb{x}$ into $K$ sub-vectors $\mathbf{x}_1,...,\mathbf{x}_K$. For each of sub-vector $k=1,...,K$ you fit the model with the "training" data $\mathbf{x}_{-k}$ and then measure the fit of the model with the "testing" data $\mathbf{x}_k$. In each fit you get an estimator for the model parameters, which then gives you predictions of the testing data, which can then be compared to the actual testing data to give a measure of "loss":
$$\begin{matrix} \text{Estimator} & & \hat{\theta}(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Predictions} & & \hat{\mathbf{x}}_k(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Testing loss} & & \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda). \\[6pt] \end{matrix}$$
The loss measures for each of the $K$ "folds" can then be aggregated to get an overall loss measure for the cross-validation:
$$\mathscr{L}(\mathbf{x}, \lambda) = \sum_k \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda)$$
One then estimates the tuning parameter by minimising the overall loss measure:
$$\hat{\lambda} \equiv \hat{\lambda}(\mathbf{x}) \equiv \underset{\lambda}{\text{arg min }} \mathscr{L}(\mathbf{x}, \lambda).$$
We can see that this is an optimisation problem, and so we now have two seperate optimisation problems (i.e., the one described in the sections above for $\theta$, and the one described here for $\lambda$). Since the latter optimisation does not involve $\theta$, we can combine these optimisations into a single problem, with some technicalities that I discuss below. To do this, consider the optimisation problem with objective function:
$$\begin{equation} \begin{aligned} \mathcal{H}_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - \delta \mathscr{L}(\mathbf{x}, \lambda), \\[6pt] \end{aligned} \end{equation}$$
where $\delta > 0$ is a weighting value on the tuning-loss. As $\delta \rightarrow \infty$ the weight on optimisation of the tuning-loss becomes infinite and so the optimisation problem yields the estimated tuning parameter from $K$-fold cross-validation (in the limit). The remaining part of the objective function is the standard objective function conditional on this estimated value of the tuning parameter. Now, unfortunately, taking $\delta = \infty$ screws up the optimisation problem, but if we take $\delta$ to be a very large (but still finite) value, we can approximate the combination of the two optimisation problems up to arbitrary accuracy.
From the above analysis we can see that it is possible to form a MAP analogy to the model-fitting and $K$-fold cross-validation process. This is not an exact analogy, but it is a close analogy, up to arbitrary accuracy. It is also important to note that the MAP analogy no longer shares the same likelihood function as the original problem, since the loss function depends on the data and is thus absorbed as part of the likelihood rather than the prior. In fact, the full analogy is as follows:
$$\begin{equation} \begin{aligned} \mathcal{H}_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - \delta \mathscr{L}(\mathbf{x}, \lambda) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda)}{\int L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda) d\theta} \Bigg) + \text{const}, \\[6pt] \end{aligned} \end{equation}$$
where $L_\mathbf{x}^*(\theta, \lambda) \propto \exp( \ell_\mathbf{x}(\theta) - \delta \mathscr{L}(\mathbf{x}, \lambda))$ and $\pi (\theta, \lambda) \propto \exp( -w(\theta|\lambda))$, with a fixed (and very large) hyper-parameter $\delta$.
(Note: For a related question looking at logistic ridge regression framed in Bayesian terms see here.)
$^\dagger$ This gives an improper prior in cases where the penalty does not correspond to the logarithm of a sigma-finite density.