[Math] Correlation Coefficient Distribution Function: An Apparent Discrepancy

correlationhypothesis testingprobability distributionsstatistics

I'd like to explain an apparent discrepancy between:

(1) The sample correlation distribution function between sample vectors for a bivariate, correlated random variable (correlation coefficient = $\rho$) and

(2) The sample correlation distribution function for two normally distributed random vectors that each contain a given signal with additive Gaussian noise. In this latter case, the correlation coefficient between these vectors $\rho$.

While (1) gives the standard Pearson Correlation Coefficient, (2) is more consistent with time series analysis and match filtering/correlation detection. I'd like to use a distribution function for (2) to identify the detection performance of such a correlation detector. Please read on below.

First sample correlation computation:

The "correct" way to view the correlation distribution function is as follows:

Suppose pairs $(x_{1}, s_{1})$, $(x_{2}, s_{2})$, $(x_{3}, s_{3})$, $\cdots$, $(x_{n}, s_{n})$ are a sample of size $n$ drawn from the distribution of the random vector $(S, \, X)^{\prime}$, which has a bivariate normal distribution with mean vector $(\mu_{S}, \, \mu_{X})^{\prime}$ and covariance:

\begin{equation}
\Sigma =
\begin{bmatrix}
\sigma_{S}^{2} & \sigma_{S\,X} \\[0.3em]
\sigma_{S\,X} & \sigma_{X}^{2}
\end{bmatrix},
\end{equation}

where the population correlation coefficient is:

\begin{equation}
\rho = \cfrac{\sigma_{S\,X}}{\sigma_{X}\, \sigma_{S}}
\end{equation}

The estimator for $\rho$, call it $\hat{\rho}$, is:

\begin{equation}
\hat{\rho}_{1}
= \frac{ \boldsymbol{x}^{\text{T}} \boldsymbol{s} } { \left|\left|\, \boldsymbol{x} \,\right|\right| \,\left|\left| \,\boldsymbol{s} \,\right|\right| }
\end{equation}

where $\boldsymbol{x}$ and $\boldsymbol{s}$ are concatenated $n$-dim vectors formed from from the samples $(x_{k}, s_{k})$ above ($k$ $=$ $1,2,\cdots,n$). Coefficient $\hat{\rho}$ has a Pearson's correlation coefficient distribution function, that I will call $p_{R}\left( r \,; \rho, n, \,\mathcal{H}_{1}\right)$. The expression $\mathcal{H}_{1}$ is borrowed from hypothesis decision theory. It indicates that the true correlation is nonzero, e.g., $\rho$ $\ne$ $0$.

Second sample correlation computation:

OK: Now take two vectors, $\boldsymbol{x}$ and $\boldsymbol{s}$. This time, assume:

$\boldsymbol{x}$ $\sim$ $\mathcal{N}(\boldsymbol{\mu}_{X}, \sigma_{X}^{2}\mathbf{I})$

and:

$\boldsymbol{s}$ $\sim$ $\mathcal{N}(\boldsymbol{\mu}_{S}, \sigma_{S}^{2}\mathbf{I})$

Again, compute the sample correlation distribution function:

\begin{equation}
\hat{\rho}_{2}
= \frac{ \boldsymbol{x}^{\text{T}} \boldsymbol{s} } { \left|\left|\, \boldsymbol{x} \,\right|\right| \,\left|\left| \,\boldsymbol{s} \,\right|\right| }
\end{equation}

Suppose now that $\rho$ is the same in both cases.

Inconsistency:

These coefficients DO NOT have the same distribution functions. That is, the distribution function for $\rho_{1}$ and $\rho_{2}$ do not overlap. In fact, the latter is lower variance, and shifted to the right of the first. The following figure illustrates that while $\rho_{1}$ has the "expected" Pearson correlation coefficient distribution (left panel), the second coefficient $\rho_{2}$ behaves as though it has more samples, and a higher correlation coefficient (right panel).

Correlation Distribution for Two Cases

A parameter fit for $\rho$ and $n$ using $p_{R}\left( r \,; \rho, n, \,\mathcal{H}_{1}\right)$ illustrates that the distribution for $\rho_{2}$ is parametrized differently (blue curve, right panel). Both $\rho$ and $n$ are larger than expected.

My Question:

Can we develop the correct distribution function for $\rho_{2}$? Or, can we "correct" $\rho$ and $n$ and simply use $p_{R}\left( r \,; \rho, n, \,\mathcal{H}_{1}\right)$?

To generate these figures, Matlab Code follows below:

%NOTE! Function corrdist.m is commented out at bottom of code. Save
%the function as it's own .m file and comment out to run this code.

%STEP 0: 
%PROVIDE PRELIMINARY DEFINITIONS.

clear h;
%length of time series/dimension of vector (make N even for convenience)
N       = 1e2;
%number of samples drawn from population (to make histogram). Choose a very
%large sample to make the histogram smooth.
Nsim    = 2e4;

%the true correlation coefficient (||n||^2 = sigma^2*(N-1) = 1*(N-1) ):
r0  = 1/( 1 + (N-1)./(s'*s));

%now add N(0,1) noise to signal to make Nsim sample-vectors from same
%distribution. Call result x.

%signal definition
s   = sqrt(2)*randn(N,1);
x   = repmat(s,1,Nsim) + randn(N, Nsim);

%STEP 1: 
%COMPUTE THE HISTOGRAM FOR THE CORRELATION COEFFICIENT BETWEEN TWO
%NORMALLY DISTRIBUTED VECTORS WITH SPECIFIED CORRELATION r0.

%Define a bivariate, normally distributed random variable z.
R   = chol([1,r0;r0,1]);
mu  = [0,0];

%Compute a correlation coefficient between each random vector
%This is not efficient, but it shows the reader (you) what I am doing:
c   = [];
for k = 1:Nsim,

    z       = (repmat(mu,N,1) +  randn(N,2)*R);
    c       = cat(1, c, z(:,1)'*z(:,2)/(norm(z(:,1))*norm(z(:,2))));    
end;

subplot(1,2,1);
[Nb, b]     = hist(c, floor(sqrt(length(cc))));
Nb          = Nb./trapz(b, Nb);

h(1) = bar(b, Nb,'facecolor','k','edgecolor','none');
hold on;
r       = linspace(0,1,1e4);
h(2)    = plot(r, corrdist(r, r0, N),'-r','linewidth',4);
f1      = gcf;
a1      = gca;

legend(h,{'Sample Correl. Hist.','Theor. Distr.'},'Location','Best');
legend boxoff;
title('Using Correlated Normal Random Variables');
xlabel('Coefficient');
ylabel('Normalized Density');

%STEP 2: 
%COMPUTE THE HISTOGRAM FOR THE CORRELATION COEFFICIENT BETWEEN TWO
%NORMALLY DISTRIBUTED VECTORS THAT CONSIST OF A GIVEN SIGNAL, PLUS ADDITIVE
%GAUSSIAN NOISE.

%normalize data vectors (columns) for correlation coefficient computation. 
%Take care not to include norm-computations that are below eps to 
%avoid numerical blow-up (divide by zero).
n           = sqrt( sum( x.*conj(x),1) );
m           = n(n>eps);
x(:,n<eps)  = [];
Nsim        = length(m);
temp        = repmat(m.^(-1),size(x,1),1);

%normalize x = signal + noise for correlation computation. 
x           = x.*temp;

%now compute the sample correlation between the first sample vector and the
%following independent and identically distributed column vectors in x.
cc      = x(:,1)'*x(:,2:end);

%compute histogram, and normalized histogram.
[Nb, b]     = hist(cc, floor(sqrt(length(cc))));
Nb          = Nb./trapz(b, Nb);

%NOTE: The correlation distribution function cc and the histogram do not
%agree in this case.
subplot(1,2,2);
h(3) = bar(b, Nb,'facecolor','k','edgecolor','none');
hold on;
r   = linspace(0,1,1e4);
h(4) = plot(r, corrdist(r, r0, N),'-r','linewidth',4);

%BUT: The EFFECTIVE distribution function for cc behaves as though it has
%many more degrees of freedom, and a different true correlation:

%Find the best fit correlation value and sample number to fit histogram.
dof     = fminsearch( @(p) norm( corrdist(b,p(1),p(2)) - Nb), [r0, N]);
h(5)    = plot(r, corrdist(r, dof(1), dof(2)),'-b','linewidth',4);
f2      = gcf;
a2      = gca;
ylimRef = ylim;
ylim(a1,ylimRef);
set(a2,'ytick',[]);
set(a1,'xlim',[0.27, 0.92]);
set(a2,'xlim',[0.27, 0.92]);

legend(h(3:end),{'Correl. Histogram','Theor. Distr.','Best-Fit Distr.'},'Location','Best');
legend boxoff;
title('Using Signal + Noise Vectors');
%Note that the latter,  best-fit correlation distribution fits the 
%histogram MUCH better than the correlation distribution function using the
%assumed parameters, r0 and N.

mydir   = pwd;

set(gcf,'units','normalized','outerposition',[0,1,1,1]);
set(gcf,'Units','points')
set(gcf,'PaperUnits','points')

sizeX= get(gcf,'Position');
sizeX= sizeX(3:4);
set(gcf,'PaperSize',sizeX)

set(gcf,'PaperPosition',[0,0,sizeX(1),sizeX(2)])

print -dpdf  -r300 CorrelationParadox;

> %CORRDIST.M IS DOWN HERE! 
%CORRDIST.M IS DOWN HERE!
%
% function y = corrdist(r, ro, n) 
% %This function computes the probability density function for the
% %correlation coefficient of a bivariate random variable.
% %
% % USAGES
% % y = corrdist(r, ro, n)
% %
% % INPUT
% % r:    Vector of possible correlation random variables, i.e. the values at
% %       which the pdf is evaluated at.
% % ro:   The given (true) correlation coefficient, i.e. the population
% %       correlation coefficient. length(ro) > 1 supported.
% % n:    The number of samples in the correlated data. Only length(n) = 1
% %       supported.
% % 
% % OUTPUT
% % y:    The probability density function for r, given ro, for n data
% %       samples of a bivariate normal distribution.
% %
% %-----------------------------------------------------------------------
% % Latest Edit: 11.June.2012
% % Joshua D Carmichael
% % josh.carmichael@gmail.com
% %
% % Original Author: Xu Cui, Stanford University (retrieved 11.June.2012)
% %-----------------------------------------------------------------------
% 
% %accept vectorized inputs.
% if(length(ro)> 1.5),
%     r   = repmat(r(:),1,length(ro));
%     ro  = repmat(ro(:)', length(r),1);
% end;
% 
% if( n < 120 ),
%     
%     y = (n-2) * gamma(n-1) * ((1-ro.^2).^((n-1)/2)).* (1-r.^2).^((n-4)/2);
%     y = y./ (sqrt(2*pi) * gamma(n-1/2) * (1-ro.*r).^(n-3/2));
%     y = y.* (1+ 1/4*(ro.*r+1)/(2*n-1) + 9/16*(ro.*r+1).^2 / (2*n-1)/(2*n+1));
%     
% else
%     
%     y = (n-2) * (1-ro.^2)^((n-1)/2) * (1-r.^2).^((n-4)/2);
%     y = y./ (sqrt(2*pi) * (1-ro.*r).^(n-3/2)) * n.^(-1/2);
%     y = y.* (1+ 1/4*(ro.*r+1)/(2*n-1) + 9/16*(ro.*r+1).^2 / (2*n-1)/(2*n+1));
%     
% end;
% 
% y(r>1)              = 0;
% y(r<-1)             = 0;
% y(~isfinite(y))     = 0;
% 
% return;

Best Answer

I think I have found the discrepancy.The short answer is that the distribution for case (2) is a doubly noncentral beta distribution. The bulk correlation is the same under each case.

In the solution below, I've used slightly different notation. In particular, $\boldsymbol{s }$ $\rightarrow$ $\boldsymbol{w}$. I've also used $\hat{A}$ to describe the maximum liklehood estimate for the amplitude of the sample vector $\boldsymbol{w}$ that maximally correlates with $\boldsymbol{x}$. This notation and formulation is targeted at detection theory applications.

The relative square error in approximating approximating a data stream $ \boldsymbol{x} $ using a waveform template $\boldsymbol{w }$ is the ratio of the least-squares error $\vert\vert \boldsymbol{e} \vert\vert^{2}$ to measure signal energy $ \vert \vert \boldsymbol{x} \vert \vert^{2}$. This error may be re-written as a ratio of quadratic forms that has a doubly noncentral $\beta$ distribution:

\begin{equation} \begin{split} \cfrac{\vert \vert \boldsymbol{e} \vert \vert^{2}}{\vert \vert \boldsymbol{x} \vert \vert^{2}} &= \cfrac{\vert \vert \boldsymbol{x} - \hat{A} \boldsymbol{w } \vert \vert^{2} }{ \vert \vert \boldsymbol{x} \vert \vert^{2}} \\ &= \cfrac{\vert \vert \boldsymbol{x} - \cfrac{\langle \boldsymbol{x},\, \boldsymbol{w} \rangle}{ \vert \vert \boldsymbol{w } \vert \vert^{2}} \boldsymbol{w } \vert \vert^{2}} {\vert \vert \boldsymbol{x} \vert \vert^{2}} \\ &= \cfrac{ \vert \vert P_{\boldsymbol{w}}^{\perp} \left( \boldsymbol{x} \right) \vert \vert ^{2}} { \vert \vert P_{\boldsymbol{w}}^{\perp}\left( \boldsymbol{x} \right) \vert \vert ^{2} + \vert \vert P_{\boldsymbol{w}} \left( \boldsymbol{x} \right) \vert \vert ^{2}} \\ &\overset{d}{=} \cfrac{ \chi_{1}^{2}( \lambda^{\perp} )} { \chi_{1}^{2}( \lambda ) + \chi_{N_{E} - 1}^{2}( \lambda^{\perp} ) }, \end{split} \end{equation}

where the noncentrality parameters are defined by $\lambda$ $=$ $\cfrac{\vert \vert P_{\boldsymbol{w}} \left( \boldsymbol{x} \right) \vert \vert ^{2}}{\sigma^{2}}$ and $\lambda^{\perp}$ $=$ $\cfrac{\vert \vert P_{\boldsymbol{w}}^{\perp} \left( \boldsymbol{x} \right) \vert \vert ^{2}}{\sigma^{2}}$, and where $\overset{d}{=}$ indicates distributional equality. This ratio is also related to the sample correlation coefficient $r$:

\begin{equation} \begin{split} \cfrac{\vert \vert \boldsymbol{x} - \cfrac{\langle \boldsymbol{x},\, \boldsymbol{w} \rangle}{ \vert \vert \boldsymbol{w } \vert \vert^{2}} \boldsymbol{w } \vert \vert^{2}} {\vert \vert \boldsymbol{x} \vert \vert^{2}} &= 1- \cfrac{\langle \boldsymbol{x},\, \boldsymbol{w} \rangle^{2} }{ \vert \vert \boldsymbol{w } \vert \vert^{2} \vert \vert \boldsymbol{x } \vert \vert^{2}} \\ &= 1 - r^{2} \end{split} \end{equation}

Therefore:

\begin{equation} \begin{split} r^{2} &\overset{d}{=} \cfrac{ \chi_{1}^{2}( \lambda )} { \chi_{1}^{2}( \lambda ) + \chi_{N_{E} - 1}^{2}( \lambda^{\perp} ) } \\ &\sim \text{Beta} \left( \frac{1}{2}, \frac{1}{2}N_{E} ; \lambda, \lambda^{\perp} \right) \end{split} \end{equation}

Distinct hypotheses regarding the distribution for $\boldsymbol{x}$ simplify the form for this distribution. When the data stream contains only noise, the hypothesis $\mathcal{H}_{0}$ is satisfied and $r^{2}$ has a central Beta distribution, where $\lambda^{\perp}$ $=$ $\lambda$ $=$ $0$. In the presence of signal, a data stream $\boldsymbol{x}$ will generally have a non-zero projection $P_{\boldsymbol{w}}^{\perp}\left( \boldsymbol{x} \right)$ orthogonal to the noise-contaminated template data vector $\boldsymbol{w}$. In this case $\lambda$, $\lambda^{\perp}$ $\ne$ $0$, and $r^{2}$ has doubly noncentral Beta distribution. If the template signal has a very large SNR, then $ \boldsymbol{x}$ $\cong$ $A \boldsymbol{w}$ $+$ $\boldsymbol{n}$, $\lambda^{\perp}$ $=$ $0$, and $r^{2}$ is reasonably approximated by a noncentral Beta distribution. The noncentral Beta distribution therefore provides an absolute upper bound on the detection performance of a correlation detector.

Related Question