Regression – Relationship Between Least-Squares Regression and Information Theory

entropyinformation theoryleast squaresregressionregression coefficients

Is there a well-known relationship between least-squares regression and information theory? I've just started reading about information theory. It seems almost trivial to say that the regression coefficients from least-squares tell us how much information the predictor variables hold about the dependent variable, but I don't know if this has been studied in any depth. For a simple regression with a single predictor variable, I arrive at the following. Does it make any sense? Does it extend to multiple linear regression, general linear models etc?

Information theory

The information entropy of a random variable $X$ is given by

$$H(X) = \int \mathrm{P}(x)\,\mathrm{I}(x) dx = -\int {\mathrm{P}(x) \ln \mathrm{P}(x)} ~dx,$$
where $\mathrm{P}(x)$ is the probability density function, and $I(x) = -\text{ln}(\mathrm{P}(x))$ is the self information of $x$.

Least-squares

Now suppose we have a dependent variable, $y$, and a predictor variable, $p$. Using least-squares regression We can write a simple linear model

$$y = \beta_{yp} p + e.$$
It seems to me that the regression coefficient $\beta_{yp}$ is a measure of the information held by $p$ about $y$. If we re-write the model as

$$y = -\text{ln}(e^{-\beta_{yp}}) p + e,$$
where $\beta_{yp} = -\text{ln}(e^{-\beta_{yp}})$ then we have the regression coefficient in a similar form to self information. If we take the absolute value of the beta coefficicient, $\mathopen|\beta_{yp}\mathclose|$, then the quantity $e^{-\mathopen|\beta_{yp}\mathclose|}$ looks like the probability density function $P(\mathopen|\beta_{yp}\mathclose|)$ over all possible $\beta$ coefficients. It relates to how likely it is that we got a beta coefficient with a magnitude as large as this. In turn, we find the information entropy to be

$$H(B) = – \int_{0}^{\infty} e^{-\mathopen|\beta_{yp}\mathclose|} \ln(e^{-\mathopen|\beta_{yp}\mathclose|}) ~d\mathopen|\beta_{yp}\mathclose|= e^{-\mathopen|\beta_{yp}\mathclose|}(\mathopen|\beta_{yp}\mathclose| – 1) \bigg|_{0}^{\infty}= 1.$$

Hope that makes some sense. Let me know what you think, if you have any references or suggestions. Thanks.

Best Answer

The entropy of the random variable $Y$, $H(Y$) is sometimes called a measure of your "uncertainty" about $Y$. What if you know about this other variable, $X$? Your uncertainty about $Y$ given $X$ will go down. This reduction is called mutual information and is written $I(Y;X) = H(Y) - H(Y|X)$. From an information point of view, I think this is what you want to know: how is the information that $X$ gives me about $Y$ reflected in the regression coefficient? I'll show how to estimate the mutual information in terms of the regression coefficient, under some assumptions.

One way to approach this problem is to assume that the variables are normally distributed. Suppose $X$ is a random variable drawn from a Gaussian with mean $\mu_x$ and standard deviation $\sigma_x$. If $y = \beta_{y,x} x + \epsilon$ and epsilon is Gaussian noise (uncorrelated with $x$) with standard deviation $\sigma_x$, then the Pearson correlation coefficient, $\rho^2 = \beta_{y,x}^2 / (\beta_{y,x}^2 + \sigma_\epsilon^2/\sigma_x^2)$. Under these assumptions, the correlation coefficient is related to the mutual information: $$I(Y;X) = -1/2 \log(1-\rho^2) = 1/2 \log\left(1 + \beta_{y,x}^2 / (\sigma_\epsilon^2/\sigma_x^2)\right) $$ (using the definition of entropy for normal distributions). Actually, this is even true under some slightly weaker assumptions. If you really just want the entropy of $Y$, you can write it as a function of this expression as $H(Y) = I(X;Y) + H(Y|X)$. This type of analysis comes up in analyzing the capacity of the additive white Gaussian noise channel.

Just a few qualitative comments about this solution. This mutual information is always non-negative and zero if $\beta=0$. I guess you can interpret $\sigma_x/\sigma_\epsilon$ as a signal-to-noise ratio (SNR). If the SNR is small then it washes out any informative value that $x$ has about $y$.

Generally, the regression coefficient has no relationship to mutual information. I.e., you can find distributions with fixed $\beta$ where MI can be anything. Heuristically, though, it makes sense to think of the equation above as a lower bound.

For another connection between regression and information theory, I can also suggest this paper showing the relationship between linear regression and transfer entropy.

Related Question