If I'm not wrong both "quasi" and "pseudo" denote the same thing, namely the optimization under wrong distributional assumptions. Moreover I think that the terms are not restricted to the assumption of normality. Is there an experienced reader who can confirm this? Cheers!
Solved – Quasi maximum likelihood estimation versus pseudo MLE
maximum likelihood
Related Solutions
I think the core misunderstanding stems from questions you ask in the first half of your question. I approach this answer as contrasting MLE and Bayesian inferential paradigms. A very approachable discussion of MLE can be found in chapter 1 of Gary King, Unifying Political Methodology. Gelman's Bayesian Data Analysis can provide details on the Bayesian side.
In Bayes' theorem, $$p(y|x)=\frac{p(x|y)p(y)}{p(x)}$$ and from the book I'm reading, $p(x|y)$ is called the likelihood, but I assume it's just the conditional probability of $x$ given $y$, right?
The likelihood is a conditional probability. To a Bayesian, this formula describes the distribution of the parameter $y$ given data $x$ and prior $p(y)$. But since this notation doesn't reflect your intention, henceforth I will use ($\theta$,$y$) for parameters and $x$ for your data.
But your update indicates that $x$ are observed from some distribution $p(x|\theta,y)$. If we place our data and parameters in the appropriate places in Bayes' rule, we find that these additional parameters pose no problems for Bayesians: $$p(\theta|x,y)=\frac{p(x,y|\theta)p(\theta)}{p(x,y)}$$
I believe this expression is what you are after in your update.
The maximum likelihood estimation tries to maximize $p(x,y|\theta)$, right?
Yes. MLE posits that $$p(x,y|\theta) \propto p(\theta|x,y)$$ That is, it treats the term $\frac{p(\theta,y)}{p(x)}$ as an unknown (and unknowable) constant. By contrast, Bayesian inference treats $p(x)$ as a normalizing constant (so that probabilities sum/integrate to unity) and $p(\theta,y)$ as a key piece of information: the prior. We can think of $p(\theta,y)$ as a way of incurring a penalty on the optimization procedure for "wandering too far away" from the region we think is most plausible.
If so, I'm badly confused, because $x,y,\theta$ are random variables, right? To maximize $p(x,y|\theta)$ is just to find out the $\hat{\theta}$?
In MLE, $\hat{\theta}$ is assumed to be a fixed quantity that is unknown but able to be inferred, not a random variable. Bayesian inference treats $\theta$ as a random variable. Bayesian inference puts probability density functions in and gets probability density functions out, rather than point summaries of the model, as in MLE. That is, Bayesian inference looks at the full range of parameter values and the probability of each. MLE posits that $\hat{\theta}$ is an adequate summary of the data given the model.
Usually, maximum likelihood is used in a parametric context. But the same principle can be used nonparametrically. For example, if you have data consisting in observation from a continuous random variable $X$, say observations $x_1, x_2, \dots, x_n$, and the model is unrestricted, that is, just saying the data comes from a distribution with cumulative distribution function $F$, then the empirical distribution function $$ \hat{F}_n(x) = \frac{\text{number of observations $x_i$ with $x_i \le x$}}{n} $$ the non-parametric maximum likelihood estimator.
This is related to bootstrapping. In bootstrapping, we are repeatedly sampling with replacement from the original sample $X_1,X_2, \dots, X_n$. That is exactly the same as taking an iid sample from $\hat{F}_n$ defined above. In that way, bootstrapping can be seen as nonparametric maximum likelihood.
EDIT (answer to question in comments by @Martijn Weterings)
If the model is $X_1, X_2, \dotsc, X_n$ IID from some distribution with cdf $F$, without any restrictions on $F$, then one can show that $\hat{F}_n(x)$ is the mle (maximum likelihood estimator) of $F(x)$. That is done in What inferential method produces the empirical CDF? so I will not repeat it here. Now, if $\theta$ is a real parameter describing some aspect of $F$, it can be written as a function $\theta(F)$. This is called a functional parameter. Some examples is $$ \DeclareMathOperator{\E}{\mathbb{E}} \E_F X=\int x \; dF(x)\quad (\text{The Stieltjes Integral}) \\ \text{median}_F X = F^{-1}(0.5) $$ and many others. The parameter space is $$\Theta =\left\{ F \colon \text{$F$ is a distribution function on the real line } \right\}$$
By the invariance property (Invariance property of maximum likelihood estimator?) we then find mle's by $$ \widehat{\E_F X} = \int x \; d\hat{F}_n(x) \\ \widehat{\text{median}_F X}= \hat{F}_n^{-1}(0.5). $$ It should be clearer now. We don't (as you ask about) use the empirical distribution function to define the likelihood, the likelihood function is completely nonparametric, and the $\hat{F}_n$ is the mle. The bootstrap is then used to describe the variability/uncertainty in mle's of $\theta(F)$'s of interest by resampling (which is simple random sampling from the $\hat{F}_n$.)
EDIT In the comment thread many seems to disbelieve this (which really is a standard result!) result. So trying to make it clearer. The likelihood function is nonparametric, the parameter is $F$, the unknown cumulative distribution function. For a given cutoff point in $\mathbb{R}$, a function of the parameter is $\DeclareMathOperator{\P}{\mathbb{P}} x(F)=F(x)=\P(X \le x)$. A corresponding transformation of the random variable $X$ is $I_x=\mathbb{I}(X\le x)$ which is a Bernoulli random variable with parameter $x(F)$. The maximum likelihood estimate of $x(F)$ based on the sample of $I_x(X_1), \dotsc, I_x(X_n)$ is the usual fraction of $X_i$'s that is lesser or equal to $x$, and the empirical cumulative distribution function expresses this simultaneously for all $x$. Hopes this is clearer now!
Best Answer
Quasi-likelihood and Pseudo-likelihood mean different things. If the probability model is possibly misspecified, then the likelihood function is called a quasi-likelihood function (see White 1982 econometrica for example). In the special case where the probability model is correctly specified then the quasi-maximum likelihood estimation is the same as maximum likelihood estimation. The terminology "pseudo-likelihood" is not as established but typically means that independence assumptions are violated so that the the independence assumptions which permit the likelihood function to be constructed as a product of other likelihood functions are violated but the likelihood function is constructed as a product of other likelihood functions anyway. Thus, every pseudo-likelihood function is a quasi-likelihood function but every quasi-likelihood function is not necessarily a pseudo-likelihood function. See Besag 1986 "Analysis of Dirty Pictures" (Journal of Royal Statistical Society Series B, Vol. 48 for discussion of the pseudolikelihood function). These terms are not restricted to the assumption of normality.