Least squares, Lasso and Rigde regression minimie the following objective functions respectively:
$\min ||y - X \beta||_2^2 $
$\min ||y - X \beta||_2^2 + \lambda ||\beta||_1 $,
$\min ||y - X \beta||_2^2 + \lambda ||\beta||_2 $,
until this point, this optimization has nothing to do with distribution(No assumption made on the distribution of y and parameter).
However, it would be preferred if we can add probability interpretation for the minimizer, this is why assume some distribution on y and parameters.
Now assume that
$y|X,\beta \sim N(X \beta, \sigma I)$, then Least square minimizer is the Maximum likelihood estimator.
Further if assume $\beta \sim N(0, I)$, then rigde minimizer is the maximum a posterior probability (MAP) estimator
while assume $\beta$ laplace distribution, then lasso minimizer is also the maximum a posterior probability (MAP) estimator.
In summary, we assume distribution on y and $\beta$ to give proability interpretation of the minimizer, However, these assumptions not necessarily hold in reality. Just for interpretation purpose only.
Let $\Theta$ be the parameter space and $\mathcal{X}$ be the space of possible obervations.
A Bayesian model is a probability distribution over $\mathcal{X} \times \Theta$.
This is usually written as a likelihood $f(x|\theta)$ and a prior $\pi(\theta)$, but it should be clear that these components exactly encode the joint distribution of the observations and parameters.
In this sense, both parameters and observations are random, and indeed, Bayes rule relies on the observations being realisations of random variables.
For this reason, I think that the "observations are fixed" statement is overstating the way in which Bayesian inference is somehow dual to frequentist inference.
The statement isn't without merit, though.
The object of Bayesian inquiry is the posterior $\pi(\theta|x)$, where $x$ is the particular data that was observed in the experiment, not an abstract random variable.
Although we had to think of the observations as random to determine this distribution on $\Theta$, once we have it, we never have to think about other values of $x$ again - only the observed ones matter.
For example, as Bayesians we might compute the posterior mean of $\theta$
$$
\int_{\Theta} \theta\cdot \pi(\theta|x) \, \mathrm{d} \theta,
$$
in which we fix $x$ and vary $\theta$ over the whole parameter space.
Essentially all Bayesian inference is similar to this.
Compare this to frequentim, where a common object of inference is the likelihood function $f(y; \hat{\theta}(x))$, where $\hat{\theta}(x)$ is the MLE of $\theta$.
Now, the only parameter value that we need to consider is $\hat{\theta}(x)$, but we still need other values in $\mathcal{X}$.
For example, a frequentist might want to compute the observed Fisher information
$$
-\int_{\mathcal{X}} \nabla^2_\theta \log f(y; \hat{\theta}(x))\cdot f(y; \hat{\theta}(x))\, \mathrm{d}y,
$$
in which it is now the parameter value that is fixed at $\hat{\theta}(x)$ and the observations $y$ that are varied over the whole observation space.
Best Answer
Let's start with a very simple example of Bayesian inference that includes some of the issues you raise. Then you may have a framework for follow-up questions and raising additional issues.
A political consultant is hired to advise one candidate in an upcoming election. From prior experience with other elections and some knowledge of the candidate, the consultant has the prior distribution $\mathsf{Beta}(330, 270)$ for the probability $\theta$ that the candidate will win. That is, the consultant thinks the probability the candidate will win is roughly 0.55 and likely between 0.51 and 0.59. Computation in R:
The prior distribution has density proportional to $p(\theta) \propto \theta^{330-1}(1-\theta)^{270-1}.$
Choosing the prior distribution is often at least partially a matter of opinion. The consultant might have been just as happy with another similar beta distribution as her prior.
Results of a public opinion poll by a reputable pollster show that $x = 620$ out of $n = 1000$ randomly chosen likely voters favor the candidate. Thus the binomial likelihood is proportional to $L(x|\theta) = \theta^{620}(1-\theta)^{1000-620}.$
Then by Bayes' Theorem, the posterior distribution is proportional to $$\theta^{330-1}(1-\theta)^{270-1}\times\theta^x(1-\theta)^{1000-620} \propto g(\theta|x) \\ \propto \theta^{330 + 620 - 1}(1-\theta)^{270 + 1000 -620-1}\\ \propto \theta^{950-1}(1-\theta)^{650-1},$$ where we recognize the last term as proportional to the density function of $\mathsf{Beta}(950, 650).$ Information in this posterior distribution is a melding of information in the prior distribution and in the data.
In this case, it is easy to find the posterior distribution because the binomial likelihood is 'conjugate to' (mathematically compatible with) the beta density of the prior distribution.
A 95% Bayesian probability interval $(.670, 618)$ for $\theta$ can be found by cutting 2.5% of the probability from each tail of the posterior distribution. Possible point estimates are the mean, median, or mode (in this case, all about 0.594).
Here is a plot of the prior and posterior distributions. The 95% posterior probability interval is shown by dashed lines.
So data from the poll together with the prior distribution show a slightly more favorable standing of the candidate than did the prior distribution.
Notes: (1) If the prior distribution in this example had been the 'noninformative' Jeffrey's prior $\mathsf{Beta}(.5,.5),$ then the 95% Bayesian posterior interval would have been nearly the same (numerically) as a frequentist 95% confidence interval (but Bayesians and frequentists interpret interval estimates somewhat differently).
(2) A conjugate prior distribution for a Poisson likelihood function is a gamma distribution. Normal likelihood functions are conjugate to normal likelihood functions.
(3) Reference. Suess & Trumbo (2010), Springer. The example shown above is similar to one found in Chapter 8 of this book.