I think the core misunderstanding stems from questions you ask in the first half of your question. I approach this answer as contrasting MLE and Bayesian inferential paradigms. A very approachable discussion of MLE can be found in chapter 1 of Gary King, Unifying Political Methodology. Gelman's Bayesian Data Analysis can provide details on the Bayesian side.
In Bayes' theorem, $$p(y|x)=\frac{p(x|y)p(y)}{p(x)}$$
and from the book I'm reading, $p(x|y)$ is called the likelihood, but I assume it's just the conditional probability of $x$ given $y$, right?
The likelihood is a conditional probability. To a Bayesian, this formula describes the distribution of the parameter $y$ given data $x$ and prior $p(y)$. But since this notation doesn't reflect your intention, henceforth I will use ($\theta$,$y$) for parameters and $x$ for your data.
But your update indicates that $x$ are observed from some distribution $p(x|\theta,y)$. If we place our data and parameters in the appropriate places in Bayes' rule, we find that these additional parameters pose no problems for Bayesians:
$$p(\theta|x,y)=\frac{p(x,y|\theta)p(\theta)}{p(x,y)}$$
I believe this expression is what you are after in your update.
The maximum likelihood estimation tries to maximize $p(x,y|\theta)$, right?
Yes. MLE posits that $$p(x,y|\theta) \propto p(\theta|x,y)$$
That is, it treats the term $\frac{p(\theta,y)}{p(x)}$ as an unknown (and unknowable) constant. By contrast, Bayesian inference treats $p(x)$ as a normalizing constant (so that probabilities sum/integrate to unity) and $p(\theta,y)$ as a key piece of information: the prior. We can think of $p(\theta,y)$ as a way of incurring a penalty on the optimization procedure for "wandering too far away" from the region we think is most plausible.
If so, I'm badly confused, because $x,y,\theta$ are random variables, right? To maximize $p(x,y|\theta)$ is just to find out the $\hat{\theta}$?
In MLE, $\hat{\theta}$ is assumed to be a fixed quantity that is unknown but able to be inferred, not a random variable. Bayesian inference treats $\theta$ as a random variable. Bayesian inference puts probability density functions in and gets probability density functions out, rather than point summaries of the model, as in MLE. That is, Bayesian inference looks at the full range of parameter values and the probability of each. MLE posits that $\hat{\theta}$ is an adequate summary of the data given the model.
Best Answer
The objective is to estimate the parameters or, more precisely, to get a method for their estimation (since the same form of likelihood can be applied to different data sets).
There are different ways to choose parameter estimators - maximum likelihood is just one of them, which uses as the criteria for choosing the estimator that the probability of getting the observed result is maximal. Maximum likelihood estimators have many convenient mathematical properties.
$P(Y|X, \theta)$ is a function relating the predictor variables $X$ and the output variables $Y$, parametrized by parameters $\theta$. Its functional form is chosen a priori and limits how close it can be to the "true" distribution (if such a "true" distribution exists at all): e.g., normal/Gaussian function can well approximate many distributions (gamma distribution, lognormal, etc.) but it will never reveal that the underlying distribution is not normal.