Solved – Relationship between distribution fitting and simple regression

distributionsfittingmaximum likelihoodpredictive-modelsregression

This is a bit of a conceptual question that has been nagging me for a long time.

Based on a set of data, $(X_1, X_2, X_3, \ldots, X_k)$, with sample size $i = 1 \ldots n$ ,
is there an explicit relationship between

  1. Fitting a multivariate distribution on all of the data, and
  2. Estimating a regression model on the same data?

Both concepts seem very similar, for two reasons:

  1. Estimating simple linear regression models, and fitting distributions can both be accomplished using the same method, Maximum Likelihood Estimation (MLE), and
  2. After fitting a distribution (let's say the Normal) and gaining the parameters for its pdf, one can calculate the conditional distribution, $P(X_1 | X_2, X_3, \ldots , X_k)$, which would allow one to predict values for $X_1$ based on new values ($i = n+1, n+2, \ldots$) of $X_2, \ldots X_k$, very similarly to the way one could gain predictions for $X_1$ by running the following regression, with a Normally distributed error term, $\epsilon$,
    $$X_1 = \beta_0 + \beta_1X_2 + \beta_2X_3 + \ldots + \beta_{k-1}X_k + \epsilon\, ;$$ both methods allow one to make predictions with new data, after first performing some sort of fitting.

Any insights into this connection (if it is indeed real), such as the pros/cons of fitting a distribution vs estimating a simple regression model when it comes to forecasting, would be extremely appreciated.

Best Answer

  1. Estimating linear regression models, via OLS, and fitting distributions can both be accomplished using the same method, Maximum Likelihood Estimation (MLE), and

Yes, you are correct on this. When using maximum likelihood, we are always fitting some kind of distribution to the data. The difference is however between the particular kinds of distributions that we are fitting.

In regression model, we are predicting the conditional mean (but sometimes alternatively other things like median, quantiles, mode) of one variable ($X_1$ in your notation) given the other variables ($X_2,X_3,\dots,X_k$), where the relationship has a functional form $f$:

$$ E(X_1|X_2,X_3,\dots,X_k) = f(X_2,X_3,\dots,X_k) $$

so, for example, with linear regression the assumed distribution is normal, then we have

$$ X_1 \sim \mathsf{Normal}(\,f(X_2,X_3,\dots,X_k),\; \sigma^2\,) $$

where, for linear regression $f$ is a linear function

$$ f(X_2,X_3,\dots,X_k) = \beta_0 + \beta_1X_2 + \beta_2X_3 + \ldots + \beta_{k-1}X_k $$

but it doesn't have to be linear in other kinds of regression models.

On another hand, when people are "just" fitting the distribution, they usually mean by that searching for unknown parameters of a joint distribution of some variables, for example if we again used (multivariate) normal distribution, this would be something like

$$ (X_1,X_2,X_3,\dots,X_k) \sim \mathsf{MVN}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) $$

Notice the difference, that in here we do not assume any specific functional form of relationship between $X_1$ and $X_2,X_3,\dots,X_k$. In regression, we choose the functional relationship that we assume for the variables, while when fitting the distribution, the relationship is governed by the choice of the distribution (e.g. in multivariate normal distribution, it is governed by by the covariance matrix).

  1. After fitting a distribution (let's say the Normal) and gaining the parameters for its pdf, one can calculate the conditional distribution, $P(X_1 | X_2, X_3, \ldots , X_k)$, which would allow one to predict values for $X_1$ based on new values of $X_2, \ldots X_k$,

What do you mean by "new values" in here? Regression model could be something like

$$ \mathsf{salary}_i = \beta_0 + \beta_1 \mathsf{age}_i + \beta_2 \mathsf{gender}_i + \varepsilon_i $$

So if your data consisted of $i=1,2,\dots,n$ individuals, then you could make predictions about salary for $n+1$ individual, that was not observed in your data. However if you picked up another feature for the model, say $\mathsf{height}_i$, then the estimated regression model tells you nothing about the relationship between height and salary. I wouldn't call the features as "new values", because this would be very misleading.

very similarly to the way one could gain predictions for $X_1$ by running the following regression $$X_1 = \beta_0 + \beta_1X_2 + \beta_2X_3 + \ldots + \beta_{k-1}X_k + \epsilon\, ;$$ both methods allow one to make predictions with new data, after first performing some sort of fitting.

You are correct that if we know the joint distributions $p(X_1,X_2,X_3,\dots,X_k)$ and $p(X_2,X_3,\dots,X_k)$, then we can estimate the conditional distribution,

$$ p(X_1|X_2,X_3,\dots,X_k) = \frac{p(X_1,X_2,X_3,\dots,X_k)}{p(X_2,X_3,\dots,X_k)} $$

or conditional expectations, etc. The difference is however that with regression this is available right away, while in case of "raw" distribution, you would need to calculate those from the distribution (e.g. take integrals, or conduct Monte Carlo simulation).

Notice also, that with regression you cannot "go back" to the joint distribution, or estimate other kinds of conditional distributions (or expectations). So regression is a simplified case. "Simplified" is not bad in here, for example, being simplified means that you would need much less data to get reliable estimate as compared to more complicated model.