There seems to be one single intercept 49.80842, whereas it would make sense to have two different intercepts
No, it usually wouldn't make sense to have two intercepts; that only makes sense when you have a factor with two levels (and even then only if you regard the relationship holding factor levels constant).
The population intercept, strictly speaking, is $E(Y)$ for the population model when all the predictors are 0, and the estimate of it is whatever our fitted value is when all the predictors are zero.
In that sense - whether we have factor variables or numerical variables - there's only one intercept for the whole equation.
Unless, that is, you're considering different parts of the model as separate equations.
Imagine that we had one factor with three levels, and one continuous variable - for now without the interaction:
For the equation as a whole, there's one intercept, but if you think of it as a different relationship within each subgroup (level of the factor), there's three, one for each level of the factor ($a$) -- by considering a specific value of $a$, we get a specific straight line that is shifted by the effect of $a$, giving a different intercept for each group.
But now let's consider the relationship with $a$. Now for each level of $a$, if $x$ had no impact, there'd be a very simple relationship $E(Y|a=j)=\mu_j$. There's one intercept, the baseline mean (or if you conceive it that way, three, one for each subgroup -- where the intercept would be the average value in that subgroup).
(nb It may be hard to see here but the means are not equally spaced; don't be tempted by this plot to think of $y$ as linear in $a$ considered as a numeric variable.)
But now if we consider $x$ does have an impact and look at the relationship at a specific value of $x$ ($x=x_0$), as a function of $a$, $E(Y|a=j)=\mu_j(x_0)$ -- each group has a different mean, but those means are shifted by the effect of $x$ at $x_0$.
So that would be one intercept (the black dot if it's the baseline group) ... at each value of $x$.
For each of infinite number of different values that $x$ might take, there's a new intercept.
So depending on how we look at it, there's one intercept, or three, or an infinite number... but not two.
Now if we introduce an $x:a$ interaction, nothing changes but the slopes! We still can conceive of this as having one intercept, or perhaps three, or perhaps an infinite number.
So how does this all relate to two numeric variables?
Even though we didn't have it in this case, imagine that the levels of $a$ were numeric and that the fitted model was linear in $a$ (perhaps $a$ is discrete, like the number of phones owned collectively by a household). [i.e. we're now doing what I said earlier not to do, taking $a$ to be numeric and (conditionally) linearly related to $y$]
Then we'd still have one intercept in the strict sense, the value taken by the model when $x=a=0$ (even though neither variable is 0 in our sample), or one for each possible value taken by $a$ (in our sample, three different values occurred, but maybe 0, 4, 5 ... are also possible), or one for each value taken by $x$ (an infinity of possible values since $x$ is discrete). It doesn't matter if our model has an interaction, it doesn't change that consideration about how we count intercepts.
So how do we interpret the interaction term when both variables are numeric?
You can consider it as providing for a different slope in the relationship between $y$ and $x$, at each $a$ (three different slopes in all, one for the baseline and two more via interaction), or you can consider it as providing for a different slope between $y$ and (the now-numeric) $a$ at each value of $x$.
Now if we replace this now numeric but discrete $a$ with a continuous variate, you'd have an infinite number of slopes for both one-on-one relationships, one at each value of the third variable.
You effectively say as much in your question of course.
are we constrained to expressing this with absurd scenarios, such as if we had cars with 1hp we would have a modified slope for the weight equal to (−8.21662+0.02785)∗1∗weight? Or is there a more sensible way to look at this term?
Sure there is, consider values more like the mean. So for a typical relationship between mpg and wt, hold horsepower at some value near the mean. To see how much the slope changes, consider two values of horsepower, one below the mean and one above it.
Where the variable-values aren't especially meaningful in themselves (like score on some Likert-scale-based instrument say) you might go up or down by a standard deviation on the third variable, or pick the lower and upper quartile.
Where they are meaningful (like hp) you can pick two more or less typical values (100 and 200 seem like sensible choices for hp for the mtcars data, and if you also want to look at something near the mean, 150 will serve quite well, but you might choose a typical value for a particular kind of car for each choice instead)
So you could draw a fitted mpg-vs-wt line for a 100hp car and a 150hp car and a 200 hp car. You could also draw a mpg-vs-hp line for a car that weighs 2.0 (that's 2.0 thousand-pounds) and 4.0 or (or 2.5 & 3.5 if you want something nearer to quartiles).
I think there is a lot of confusion here. First, I want to remind you that OLS and MLE are statistical algorithms for estimating parameters from data. OLS says, to get the parameters estimates for a linear model, find those that minimize the sum of the squared residuals. MLE says, to get the parameter estimates for a model, find those that maximize the likelihood, which is a function that depends on characteristics proposed by the analyst.
It turns out that for a linear model, the model coefficients estimated by OLS are identical to those estimated using MLE because maximizing the likelihood is equivalent to minimizing the sum of the squared residuals when the user programs MLE in a specific way, that is, assuming the conditional density of the outcome (i.e., the density fo the error) is normally distributed. I'll this specific application of MLE $\text{MLE}_{lge}$, i.e., MLE for a linear model with Gaussian errors. $\text{MLE}_{lge}$ corresponds to finding the parameter estimates $\beta$ and $\sigma$ such that $L_{lge}(\beta,\sigma)$ is maximized, where
$$L_{lge}(\beta,\sigma)= \prod\limits_{i=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i-\mu_i)^2}{2\sigma^2}\right)$$
and $\mu_i=g(X_i)=X_i\beta$. It turns out that the $\beta$ estimates that maximize $L(\beta,\sigma)$ are exactly the same ones that minimize sum of the squared residuals (though the $\sigma$ estimates are different).
MLE is consistent when the likelihood is correctly specified. For linear regression, the likelihood is usually specified assuming a normal distribution for the errors (i.e., as $L_{lge}(\beta,\sigma)$ above). $\text{MLE}_{lge}$ is not even necessarily consistent when the errors are not normally distributed. OLS is at least consistent (and unbiased) even when the errors are not normally distributed. Because the $\beta$ estimates resulting from OLS and $\text{MLE}_{lge}$ are identical, it doesn't matter which one you use in the face of non-normality (though, again, the $\sigma$ estimates will differ).
The interpretation of parameter estimates has nothing to do with the method used to estimate them. I could pull a number out of a hat and call it the slope and it would have the same interpretation as an estimate resulting from a more legitimate method (like OLS). I go into detail about this here. The interpretation of parameter estimates comes from the model, not the method used to estimate them.
The consistency of MLE depends on correct specification of the likelihood function, which is related to the density of the outcome given the covariates. For $\text{MLE}_{lge}$, we assume the density of each outcome is a normal distribution with mean $X_i\beta$ and variance $\sigma^2$. For binary outcomes, it often makes the most sense to think that each outcome has a Bernoulli distribution with probability parameter $p_i = g(X_i)$, where $g(X_i)$ is the logit function $\frac{1}{1+\exp(-X_i \beta)}$ for logistic regression or the normal CDF for probit regression, but one can also think that the outcome has a Poisson distribution with mean parameter $\lambda_i = g(X_i)$, as done in Chen et al. (2018).
What you described is not how logistic regression works. First, you specify a likelihood function assuming a specific density, which in this case is a Bernoulli distribution with probability parameter $p_i = g(X_i) = \frac{1}{1+\exp(-X_i \beta)}$. The likelihood is then $L(\beta) = \prod\limits_{i=1}^N p_i^{y_i} (1-p_i)^{1-y_i}$. Then you find the values of $\beta$ that maximize the likelihood (which you can do using various algorithms). Statistically, it is a one-step procedure (though the actual method of estimation is an iterative process).
Here are the general steps for maximum likelihood estimation:
- Propose a distribution for each individual's outcome $y_i$. For a continuous outcome, we might think it is drawn from a normal distribution with mean $\mu_i$ and variance $\sigma^2$, and for a binary outcome, we might think it is drawn from a Bernoulli distribution with probability $p_i$.
- Propose a relationship between the distribution parameters and the collected variables. For a continuous outcome, we might think the mean is a linear function of the predictors, i.e., $\mu_i = g(X_i) = X_i\beta$, and the variance is constant. For a binary outcome, we might think the probability parameter is a logistic function of a linear combination of the predictors, i.e., $p_i = g(X_i) = \frac{1}{1+\exp(-X_i \beta)}$. This is called logistic regression. If we instead thought $p_i$ was a probit function, then it would be probit regression.
- Specify the likelihood function as a product of the individual contributions to the likelihood, which essentially is a re-write of the proposed density functions.
- Find the parameter values that maximize the likelihood; these values are the parameter estimates. If the proposed distributions for the outcomes were correct, the estimates will be consistent for their true values. (Note that maximizing the likelihood is equivalent to maximizing the log of the likelihood, so that is often done instead because computation is easier.)
To recap: OLS and MLE are both ways of estimating model parameters from data. MLE requires certain specifications by the user about the distribution of the outcome; if those specifications are correct, the estimates are consistent. $\text{MLE}_{lge}$ is one form of MLE with a specific distributional form specified. $\text{MLE}_{lge}$ and OLS yield the same slope estimates regardless of the true nature of the data (i.e., whether assumptions about normality are met). The estimates from each method are interpreted the same because the interpretation doesn't come from the estimation method. MLE for logistic regression is performed by specifying a different distribution for the outcomes (which are binary).
Best Answer
Correlated residuals in time series analysis may imply far worse than low efficiency: if the structure of autocorrelation implies integrated or near-integrated data, then any inferences about levels, means, variances, etc. may be spurious (with unknown direction of bias) because the population mean is undefined and the population variance is infinite (so, for example, the finite values $\bar{x}$ and $s_{x}$, and quantities derived from these are always false estimates of the corresponding population statistics).
That's not a problem that can be resolved by increasing sample size to offset inefficiency.
If autocorrelated errors obtain in OLS, I would say that the same issues may be present (it depends on the data generating process). Again: not an issue of efficiency.
The critical caveat is whether ordering of your data is meaningful: if the order has meaning in that it relates to the data generating process then you're in trouble.