Perhaps this should be rephrased as "attribution", but in many RL models, the signal that comprises the reinforcement (e.g. the error in the reward prediction for TD) does not assign any single action "credit" for that reward. Was it the right context, but wrong decision? Or the wrong context, but correct decision? Which specific action in a temporal sequence was the right one?
Similarly, in NN, where you have hidden layers, the output does not specify what node or pixel or element or layer or operation improved the model, so you don't necessarily know what needs tuning -- for example, the detectors (pooling & reshaping, activation, etc.) or the weight assignment (part of back propagation). This is distinct from many supervised learning methods, especially tree-based methods, where each decision tells you exactly what lift was given to the distribution segregation (in classification, for example). Part of understanding the credit problem is explored in "explainable AI", where we are breaking down all of the outputs to determine how the final decision was made. This is by either logging and reviewing at various stages (tensorboard, loss function tracking, weight visualizations, layer unrolling, etc.), or by comparing/reducing to other methods (ODEs, Bayesian, GLRM, etc.).
If this is the type of answer you're looking for, comment and I'll wrangle up some references.
I can hazard an answer here, but I think you're right to be confused.
To recap what you've said, the difference is in the criteria to evaluate predictions about the test set.
PCA uses RMSE, which simply evaluates how close the reconstructed data $\hat X$ is to the original data $X$ when encoded using $L$ components.
PPCA uses (negative) log-likelihood of the original data,
given the reconstruction and the estimated noise ($\sigma$),
$-log[ P(X | \hat X, \sigma)]$.
As discussed in Section 5.3.1 of your textbook,
the likelihood penalises the model both for errors in the value of $\hat X$,
and for how widely it spreads the probability mass ---
that is, for high values of $\sigma$, which can account for many values of $X$
but aren't very specific about which to actually expect.
I strongly suspect the decrease in log-likelihood with $L > 100$
is due to changes in the estimate of $\sigma$,
either causing it to be underestimated (model is overconfident in the reconstructed values) or overestimated (under-confident). I can't say whether it's systematically guaranteed to be one or the other, but you could easily check on a case-by-case basis.
Best Answer
Answer
As shown in my answer at http://stats.stackexchange.com/questions/210040/, the lasso estimator $$\hat\beta_\lambda = \arg\min_{\beta \in \mathbb{R}^p} \left\{ \|y - X \beta\|_2^2 + \lambda \|\beta\|_1 \right\}$$ equals zero if and only if $\lambda \geq \frac{1}{n} \|X^T y\|_\infty =: \lambda_\max$. This result does, in some sense, answer your question: we now know that for a large enough tuning parameter $\lambda$, the lasso estimator $\hat\beta_\lambda$ is fully sparse. On the other hand, we also know that there exists a least squares estimator $\hat\beta_0$ which is almost surely fully dense.
More
assumption: For convenience, let us assume that every estimator considered is unique. (If the estimators are not unique, all we've shown is that there exists at least one estimator that is sparse, which also answers the question.)
Since we've answered the other cases above, restrict $k \in (1, \dots, p-1)$. Write the knots of the lasso solution path as $0 \leq \lambda_1 \leq \dots \leq \lambda_q$. Let $\lambda_{r+1}$ be the largest knot so that $\mathrm{card}(\mathrm{supp}(\hat\beta_\lambda)) \leq k$ for all $\lambda > \lambda_{r+1}$. By definition, we know that a feature will leave the support at the knot $\lambda_{r+1}$. We will know show that exactly one will leave the support.
From KKT considerations (to be filled in later), we see that, for $\lambda \in [\lambda_r, \lambda_{r+1}]$, $$\hat\beta_\lambda^j = \hat\beta_{\lambda_r}^j + (\lambda_r - \lambda) e_j^T (X_S^T X_S)^{-1} z_S$$ for $j \in S$, and $\hat\beta_\lambda^j = 0$ for $j \not\in S$. Here $S = \mathrm{supp} (\hat\beta_{\lambda_r})$ and $z$ is the subgradient of the $\ell_1$ norm evaluated at $\hat\beta_{\lambda_r}$. Assuming that $j$ is one of the features that leaves the active set at the knot $\lambda_{r+1}$, we can set the above equation to zero and solve for $\lambda$ to find that $$\lambda_{r+1} = \lambda_r + \frac{\hat\beta_j^{\lambda_r}}{e_j^T (X_S^T X_S)^{-1} z_S}.$$ (Note that if the denominator were zero, then the $j^\textrm{th}$ feature could not be leaving the active set at the knot $\lambda_{r+1}$.) At this point, we see that the proof will conclude when it is shown that $$\lambda_j^\mathrm{update} := \frac{\hat\beta_j^{\lambda_r}}{e_j^T (X_S^T X_S)^{-1} z_S}$$ is distinct for each $j \in S$. However, the subgradient $z_S \in \{-1,1\}^{|S|}$, a finite set, and so it follows that under a continuous distribution on the response $y$, the update $\lambda_j^\mathrm{update}$ is almost surely unique for each $j \in S$.