Solved – Are real and imaginary components of frequency element of fft correlated

fourier transformindependencemodel-based-clusteringrtime series

I want to use model-based clustering to classify 1,225 time series (24 periods each). I have decomposed these time series using the fast Fourier transform and selected the harmonics that explain at least a threshold percentage of time series variance for all time series in the sample. I want to do model-based clustering on the real and imaginary parts for each transform element of a give time series because it would potentially save me from having to account for temporal autocorrelation in model based clustering across periods of a time series. I know that each complex element of the fast Fourier transform is independent from other elements, but I do not know if the imaginary and real parts of the output for a given output element are independent. I would like to know because if they were, it would allow me to maintain the default assumption of the Mclust package in R for model-based clustering that the variables analyzed have a multivariate Gaussian distribution.

NOTE: The input is real-valued, and I have converted from a two-sided to a one-sided spectrum by removing redundant frequency elements and multiplying the positive frequencies (other than the mean component) by two per the advice I got from another StackOverflow answer here: https://stackoverflow.com/questions/8264530/how-do-i-calculate-amplitude-and-phase-angle-of-fft-output-from-real-valued-in

Best Answer

The ultimate line in the question,

...the default assumption ... that the variables analyzed have a multivariate Gaussian distribution

gives us the information needed to interpret and answer it. To avoid abstractions that might obscure the simplicity of the situation, let us consider a specific example of a time series of four elements, $\mathbf{a}=(a_0, a_1, a_2, a_3) = (a,b,c,d)$, having a multivariate Gaussian (i.e., Normal) distribution. Among other things, this means the $a_j$ are normal random variables. Their discrete Fourier Transform (DFT), according to one definition, is the sequence

$$\widehat{\mathbf{a}} = \frac{1}{2}(a+b+c+d, a+b i -c+d i, a-b + c -d, a - b i - c + d i).$$

The sequence of real parts of the DFT is therefore

$$\Re(\widehat{\mathbf{a}})=\frac{1}{2}(a+b+c+d, a-c, a-b+c-d, a-c)$$

and the sequence of imaginary parts is

$$\Im(\widehat{\mathbf{a}})=\frac{1}{2}(0, b-d, 0, -b+d).$$

These are eight vector-valued random variables, of which two are degenerate (they are always zero). (A) By inspection we can detect two more linear dependencies (as we must: any six linear combinations of four variables must have at least two dependencies), leaving just four combinations

$$\eqalign{ \Re{\widehat{\mathbf{a}}_0} &= \frac{1}{2}(a+b+c+d) \\ \Re{\widehat{\mathbf{a}}}_1 = \Re{\widehat{\mathbf{a}}}_3 &= \frac{1}{2}(a-c) \\ \Re{\widehat{\mathbf{a}}}_2 &= \frac{1}{2}(a-b+c-d) \\ \Im{\widehat{\mathbf{a}}}_1 = -\Im{\widehat{\mathbf{a}}}_3 &= \frac{1}{2}(b-d).\\ }$$

It is routine to check that these are linearly independent (the determinant of the matrix of coefficients equals $1/2$, for instance). From the fact that linear combinations of marginals in a multivariate normal distribution are also multivariate normal, we see immediately that

  1. (B) The real parts of the DFT form a three dimensional multivariate normal distribution. The real parts of both $\widehat{\mathbf{a}}_0$ and $\widehat{\mathbf{a}}_2$, along with any linear combination of $\widehat{\mathbf{a}}_1$ and $\widehat{\mathbf{a}}_3$ not parallel to $(1,-1)$, are needed to span this space.

  2. (C) The imaginary parts of the DFT form a one dimensional multivariate normal distribution. The imaginary part of any linear combination of $\widehat{\mathbf{a}}_1$ and $\widehat{\mathbf{a}}_3$ not parallel to $(1,1)$, is needed to span this space.

  3. (D) The real and imaginary parts of the DFT together can be assembled to form a four dimensional multivariate normal distribution. That this has the same dimension as the length of the original series is obvious when you consider that the DFT is invertible: from the real and imaginary parts of the transform we can reconstruct the original series. (E) By diagonalizing the covariance matrix we can find linear combinations of these DFT coefficients that are uncorrelated and therefore are independent.

The answer to the question is now immediate, but let's be explicit:

I do not know if the imaginary and real parts of the output for a given output element are independent.

They are not independent (statement (A) in the foregoing example). Their real parts are not independent (B). Their imaginary parts are not independent (C). If we select appropriate linear combinations, which (if we wish) can be chosen among only the first half of the terms of $\widehat{\mathbf{a}}$ (having indexes $0$ through $2 = (4/2)$), they can be made independent (E) and form a four-dimensional multivariate Gaussian (D).

Let's make an observation about orthogonality, because that is a concept related to independence. The real and imaginary parts of the DFT are orthogonal in the sense that their inner product is always zero:

$$\eqalign{ &<\Re{\widehat{\mathbf{a}}}, \Im{\widehat{\mathbf{a}}}> \\ &= \frac{1}{4}\left((a+b+c+d)0 + (a-c)(b-d) + (a-b+c-d)0 + (a-c)(-b+d)\right) \\ &= 0. }$$

Because each of these vectors is a random variable whose components are multivariate normal, we may think of them as spanning a two-dimensional subspace of $\mathbb{R}^4$ and, within that subspace, they determine a two dimensional multivariate normal distribution. The orthogonality implies independence of the real and imaginary parts considered as marginals of this distribution.

Every major conclusion drawn about this particular example holds generally for the DFT of a time series of any length. The demonstrations are identical, but the coefficients will be different (instead of involving $1, i, -1, -i$ and their powers, which are the fourth roots of unity $\exp(2 j\pi i/4), j=0,1,2,3$, they will involve $n^\text{th}$ roots). For even values of $n$, you will find that the imaginary parts of the zeroth and middle ($n/2$) term are always zero and that, neglecting these, the other real and imaginary parts of terms $0$ through $n/2$ can be assembled into an $n$-dimensional multivariate Gaussian.

One last consideration is whether the $n$ terms thus selected among the real and imaginary parts of the DFT are independent. The answer depends on the original distribution of $\mathbf{a}$. The calculations go like this. Consider the real parts of terms $0$ and $1$ in the DFT, equal to $a+b+c+d$ and $a-c$, respectively. Then

$$\eqalign{ \text{Cov}[a+b+c+d, a-c] &= \text{Var}[a] + \text{Cov}[b,a] - \text{Cov}[c,a] \ldots - \text{Cov}[d,c]\\ &=\text{Var}[a] - \text{Var}[c] + \text{Cov}[b+d, a-c]. }$$

If $a$ and $c$ have the same variance (which can be the case for many time series models) and if $b+d$ and $a-c$ are uncorrelated (which likely is not the case for most time series models), then these two DFT coefficients are uncorrelated, whence (because they form part of a multivariate normal distribution) they are independent. In general, though, the result of this calculation is a nonzero value, implying the first two coefficients are not independent.

Related Question