How can mutual information between random variables decrease with increasing correlation

correlationentropygaussianmutual information

I have a problem where the mutual information between dependent random variables seems to decrease as their correlation increase, which goes against my intuition of mutual information. I'd appreciate if anyone could explain this phenomenom or where my reasoning is flawed.

Setup:

I have a system of $n$ continuous and dependent Gaussian variables calculated using a Gaussian Process (these variables come from a dataset of $n$ elements). The covariance of the joint distribution, $\mathbf{\Sigma} \in \mathbb{R}^{n \times n}$, is calculated using a covariance function, specifically the squared exponential function/kernel (reference):

$$ k_{SE}(x,x')=\sigma^2\exp{(-\frac{(x-x')^2}{2l^2})}$$

where $x, x'$ are elements in my dataset and $\sigma^2$ and $l$ are hyperparameters representing the kernel variance and lenghtscale, respectively. My understanding is that a greater $l$ leads to higher covariance values in $\mathbf{\Sigma}$ and thus, a higher correlation between the $n$ random variables of the system.

I calculate the mutual information between two subsets of variables, $X_1$ and $X_2$. These are two non-overlapping partitions of the original set of variables. Let $\mathbf{\Sigma}_{12}$ be the covariance matrix of the joint of these two subsets, and let $\mathbf{\Sigma}_1$ and $\mathbf{\Sigma}_2$ represent the covariance matrix of the marginal distributions. I compute mutual information the following way:

$$ I(X_1;X_2) = H(X_1) + H(X_2) – H(X_1,X_2) $$

which, in words, translates to summing the entropy of each variable subset and subtracting the entropy of the joint. Each $H(X_i)$ is calculated using the Gaussian entropy formula:

$$ H(X_i) = 0.5\ln(\det(2\pi e \mathbf{\Sigma}_i)) $$

The issue / question:

I have observed that with increasing lenghtscale parameter (and thus, with increasing correlation between random variables), the mutual information will increase then decrease beyond a certain threshold. What causes this to happen? Am I missing something? Thanks.

An example:

The following plot illustrates the behavior I am talking about. Mutual information starts decreasing with a lenghtscale of more than 1.

mutual information vs lenghtscale

This figure was generated with the following python script:

import numpy as np
import matplotlib.pyplot as plt

def k(x,xp,l,s):
    """Squared exponential kernel"""
    return s**2 * np.exp(-(x-xp)**2/(2*l**2))

def MI(C_1,C_2,C_12):
    """Mutual information between Gaussian variables"""
    H_1 = 0.5*np.linalg.slogdet(2*np.pi*np.e*C_1)[1]
    H_2 = 0.5*np.linalg.slogdet(2*np.pi*np.e*C_2)[1]
    H_12 = 0.5*np.linalg.slogdet(2*np.pi*np.e*C_12)[1]
    return H_1 + H_2 - H_12


n = 10  # size of the dataset
s = 1   # stdv of the kernel function

# Dataset
X = np.random.normal(size=n)

# >> X
# array([-0.55037423,  0.03189518, -1.64622258,  0.56081342,  0.80264256,
#         0.41539512,  0.66173243,  0.12141524,  0.70379963,  0.54954399])

# The lenghtscales to simulate
ls = np.linspace(0.1,10,100)

MIs = []
for l in ls:

    # The covariance matrix of the system
    C = np.zeros((n,n))
    for i, x in enumerate(X):
        for j, xp in enumerate(X):
            C[i,j] = k(x,xp,l,s)
    
    # Add small constant along diagonal for numerical stability
    C = C + 0.0001*np.eye(C.shape[0])

    # The covariance matrix of variable subsets
    # (here, the first five and last three variables of the system)
    n_1 = [0,1,2,3,4]
    n_2 = [7,8,9]

    C_1 = C[n_1][:,n_1]
    C_2 = C[n_2][:,n_2]
    C_12 = C[n_1+n_2][:,n_1+n_2]
    
    # The mutual information between both subsets of variables
    MIs.append(MI(C_1,C_2,C_12))

# Plot
fig = plt.figure(dpi=100)
plt.plot(ls, MIs)
plt.xlabel("lenghtscale")
plt.ylabel("Mutual Information")
plt.title("Variation of MI with increasing lenghtscale")

Best Answer

Intuitively, as the correlation increases, the entropy must decreases. This is because correlation reduce the degree of freedom within the space described by the marginal distributions. As the mutual information is concerned, since $I(X_1;X_2) = H(X_2) − H(X_2|X_1) \geq 0$, we have $H(X_2) \geq H(X_2|X_1)$. Thus conditioning reduces entropy. So we can conclude that $H(X_1;X_2)=H(X_1)+H(X_2|X_1) \leq H(X_1)+H(X_1)$, i.e. the joint entropy is upper bounded by the sum of the marginal entropies. When the joint entropy decreases it must diminish more rapidly than the sum of $H(X_1)$ and $H(X_2)$. The mutual information increases as uncertainties reduce. However, because the only constraint we have is that $H(X_1;X_2) \leq H(X_1)+H(X_1)$, the mutual information can exhibit a maximum. The entropy $h(X)$ of a continuous random variable $X$ with density $f(x)$ is called differential entropy. The differential entropy can be negative. For instance if you consider a random variable distributed uniformly in the range $S=[0,a]$ with density $1/a$ within the support $S$, you can realize that for $a<1$ the differential entropy is negative.