I think the core misunderstanding stems from questions you ask in the first half of your question. I approach this answer as contrasting MLE and Bayesian inferential paradigms. A very approachable discussion of MLE can be found in chapter 1 of Gary King, Unifying Political Methodology. Gelman's Bayesian Data Analysis can provide details on the Bayesian side.
In Bayes' theorem, $$p(y|x)=\frac{p(x|y)p(y)}{p(x)}$$
and from the book I'm reading, $p(x|y)$ is called the likelihood, but I assume it's just the conditional probability of $x$ given $y$, right?
The likelihood is a conditional probability. To a Bayesian, this formula describes the distribution of the parameter $y$ given data $x$ and prior $p(y)$. But since this notation doesn't reflect your intention, henceforth I will use ($\theta$,$y$) for parameters and $x$ for your data.
But your update indicates that $x$ are observed from some distribution $p(x|\theta,y)$. If we place our data and parameters in the appropriate places in Bayes' rule, we find that these additional parameters pose no problems for Bayesians:
$$p(\theta|x,y)=\frac{p(x,y|\theta)p(\theta)}{p(x,y)}$$
I believe this expression is what you are after in your update.
The maximum likelihood estimation tries to maximize $p(x,y|\theta)$, right?
Yes. MLE posits that $$p(x,y|\theta) \propto p(\theta|x,y)$$
That is, it treats the term $\frac{p(\theta,y)}{p(x)}$ as an unknown (and unknowable) constant. By contrast, Bayesian inference treats $p(x)$ as a normalizing constant (so that probabilities sum/integrate to unity) and $p(\theta,y)$ as a key piece of information: the prior. We can think of $p(\theta,y)$ as a way of incurring a penalty on the optimization procedure for "wandering too far away" from the region we think is most plausible.
If so, I'm badly confused, because $x,y,\theta$ are random variables, right? To maximize $p(x,y|\theta)$ is just to find out the $\hat{\theta}$?
In MLE, $\hat{\theta}$ is assumed to be a fixed quantity that is unknown but able to be inferred, not a random variable. Bayesian inference treats $\theta$ as a random variable. Bayesian inference puts probability density functions in and gets probability density functions out, rather than point summaries of the model, as in MLE. That is, Bayesian inference looks at the full range of parameter values and the probability of each. MLE posits that $\hat{\theta}$ is an adequate summary of the data given the model.
From a technical point of view, here is the argument:
For densities (but the argument is analogous in the discrete case), we write
$$ \pi \left( \theta |y\right) =\frac{f\left( y|\theta \right) \pi \left(\theta \right) }{f(y)}
$$
The norming constant can be obtained as, by writing a marginal density as a joint density and then writing the joint as conditional times marginal, with the other parameter integrated out,
\begin{align*}
f(y)&=\int f\left( y,\theta \right) d\theta\\
&=\int f\left( y|\theta \right) \pi \left(\theta \right)d\theta
\end{align*}
It ensures integration to 1 because
\begin{align*}
\int \pi \left( \theta |y\right) d\theta&=\int\frac{f\left( y|\theta \right) \pi \left(\theta \right) }{\int f\left( y|\theta \right) \pi \left(\theta \right)d\theta}d\theta\\ &=\frac{\int f\left( y|\theta \right) \pi \left(\theta \right) d\theta}{\int f\left( y|\theta \right) \pi \left(\theta \right)d\theta}\\
&=1,
\end{align*}
where we can "take out" the integral in the denominator because $\theta$ had already been integrated out there.
Best Answer
Empirical Bayes is a means of using the observed data to compute point estimates of the hyperparameters parametrising your priors. Which only makes sense in context of a hierarchical Bayesian model, where you have hyperparameters which parametrise priors on your model parameters.
Maximum likelihood is a frequentist approach - you compute point estimates of the parameters, and there is no uncertainty being modelled in these parameters through the use of priors, parametrised by hyperparameters, on said parameters.