One useful way of thinking about this is to note that there are cases when least squares and the MLE are the same eg estimating the parameters where the random element has a normal distribution. So in fact, rather than (as you speculate) that the MLE does not assume a noise model, what is going on is that it does assume there is random noise, but takes a more sophisticated view of how that is shaped rather than assuming it has a normal distribution.
Any text book on statistical inference will deal with the nice properties of MLEs with regard to efficiency and consistency (but not necessarily bias). MLEs also have the nice property of being asymptotically normal themselves under a reasonable set of conditions.
I'd like to provide a straightforward answer.
What is the main difference between maximum likelihood estimation
(MLE) vs. least squares estimation (LSE) ?
As @TrynnaDoStat commented, minimizing squared error is equivalent to maximizing the likelihood in this case. As said in Wikipedia,
In a linear model, if the errors belong to a normal distribution the
least squares estimators are also the maximum likelihood estimators.
they can be viewed as almost the same in your case since the conditions of the least square methods are these four: 1) linearity; 2) linear normal residuals; 3) constant variability/homoscedasticity; 4) independence.
Let me detail it a bit. Since we know that the response variable $y$.
$$y=w^T X +\epsilon \quad\text{ where }\epsilon\thicksim N(0,\sigma^2)$$
follows a normal distribution(normal residuals),
$$P(y|w, X)=\mathcal{N}(y|w^TX, \sigma^2I)$$
then the likelihood function(independence) is,
\begin{align}
L(y^{(1)},\dots,y^{(N)};w, X^{(1)},\dots,X^{(N)}) &= \prod_{i=1}^N \mathcal{N}(y^{(i)}|w^TX^{(i)}, \sigma^2I) \\ &=
\frac{1}{(2\pi)^{\frac{N}{2}}\sigma^N}exp(\frac{-1}{2\sigma^2}(\sum_{i=1}^N(y^{(i)}-w^TX^{(i)})^2)).
\end{align}
Maximizing L is equivalent to minimizing(since other stuff are all constants, homoscedasticity)
$$\sum_{i=1}^n(y^{(i)}-w^TX^{(i)})^2.$$
That's the least-squares method, the difference between the expected $\hat{Y_i}$ and the actual $Y_i$.
Why can't we use MLE for predicting $y$ values in linear regression
and vice versa?
As explained above we're actually(more precisely equivalently) using the MLE for predicting $y$ values. And if the response variable has arbitrary distributions rather than the normal distribution, like
Bernoulli distribution or anyone from the exponential family we map the linear predictor to the response variable distribution using a link function(according to the response distribution), then the likelihood function becomes the product of all the outcomes(probabilities between 0 and 1) after the transformation. We can treat the link function in the linear regression as the identity function(since the response is already a probability).
Best Answer
By definition, the least squares estimator minimises the sum of the squared distances between the actual and predicted responses. With a set of simple steps, you can show that this estimator is equivalent to the solution of a certain maximisation problem. If we let $f$ denote the nonlinear regression function and let $\boldsymbol{\beta}$ denote the parameter of this function (and let $\sigma>0$ be an arbitrary scaling parameter), we then have:
$$\begin{align} \hat{\boldsymbol{\beta}}_\text{OLS}(\mathbf{y}, \mathbf{x}) &\equiv \underset{\boldsymbol{\beta}}{\text{arg min}} \sum_{i=1}^n (y_i - f(\mathbf{x}_i, \boldsymbol{\beta}))^2 \\[6pt] &= \underset{\boldsymbol{\beta}}{\text{arg max}} \bigg( - \sum_{i=1}^n (y_i - f(\mathbf{x}_i, \boldsymbol{\beta}))^2 \bigg) \\[6pt] &= \underset{\boldsymbol{\beta}}{\text{arg max}} \bigg( - \frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - f(\mathbf{x}_i, \boldsymbol{\beta}))^2 \bigg) \\[6pt] &= \underset{\boldsymbol{\beta}}{\text{arg max}} \ \exp \bigg( - \frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - f(\mathbf{x}_i, \boldsymbol{\beta}))^2 \bigg) \\[6pt] &= \underset{\boldsymbol{\beta}}{\text{arg max}} \ \prod_{i=1}^n \exp \bigg( - \frac{1}{2 \sigma^2} (y_i - f(\mathbf{x}_i, \boldsymbol{\beta}))^2 \bigg) \\[6pt] &= \underset{\boldsymbol{\beta}}{\text{arg max}} \ \prod_{i=1}^n \text{N} (y_i | f(\mathbf{x}_i, \boldsymbol{\beta}), \sigma^2). \\[6pt] \end{align}$$
(These steps use the fact that the $\text{arg min}$ and $\text{arg max}$ are invariant/anti-variant to strictly monotonic transformations. Look through the steps to ensure you understand why the minimising/maximising point is preserved under the steps.) The latter estimator is an MLE for a certain nonlinear regression model form --- can you see what model form this is?
Update: Per the suggestion from Dave in comments below, now that this question is a year old we can give a full solution. From the above equation we see that the MLE matches the least squares estimator when the regression model uses IID normal (Gaussian) error terms.