Why is the FFT faster than the naive implementation of DFT

I read several explanations about why the FFT algorithm has the complexity $\mathcal{O}(n \cdot \log(n))$ and not just $\mathcal{O}(n^2)$ like if we directly multiplied the input by the Vandermonde Matrix, but I think I am missing something. I understand that we recursively split the problem in two parts, with the odd elements on one side and the even elements on the other side as shown here:

\begin{align}
X_k &= \sum_{n=0}^{N-1} x_n \cdot e^{-i~2\pi~k~n~/~N} \\
&= \sum_{m=0}^{N/2 – 1} x_{2m} \cdot e^{-i~2\pi~k~(2m)~/~N} + \sum_{m=0}^{N/2 – 1} x_{2m + 1} \cdot e^{-i~2\pi~k~(2m + 1)~/~N} \\
&= \sum_{m=0}^{N/2 – 1} x_{2m} \cdot e^{-i~2\pi~k~m~/~(N/2)} + e^{-i~2\pi~k~/~N} \sum_{m=0}^{N/2 – 1} x_{2m + 1} \cdot e^{-i~2\pi~k~m~/~(N/2)}
\end{align}

But I don't see why there are less operations when we split it in two. As I understand it, for each $X_k$ we still have to do $N/2$ additions and multiplications for the even part and $N/2$ additions and multiplications for the odd part, which is exactly the same as if we directly did N additions and multiplication for each $X_k$. So, why are there less operations needed using the FFT?

Best Answer

Consider, for instance, that you have a degree-7 polynomial $f$, and you want to evaluate it at every 8th root of unity (primitive and non-primitive).

Naively, this takes 28 multiplications and 7 additions per evaluation, for a total of 308 operations.

However, we can do much better. Split $f$ into a sum $$ f(x)=g_e(x^2)+xg_o(x^2) $$ for two degree-3 polynomials $g_e$ and $g_o$. Now if we can evaluate $g_e(x^2)$ and $xg_o(x^2)$ for only the first four $8$th roots (1, as well as the ones with positive imaginary part), then because of the even-ness / odd-ness of the two functions, evaluating $f$ at the remaining four roots come basically for free.

In total (including the squaring $x^2$), with naive polynomial evaluation, this gives 15 multiplications and 7 additions for each of the first four roots, and just a single subtraction for each of the other four. A total of 76 operations rather than 308. However, we aren't done with the optimisations.

Now for the really cool part: evaluating $g_e(x^2)$ and $g_o(x^2)$ for the first four eighth roots beginners evaluating $g_e(x)$ and $g_o(x)$ for all the fourth roots of unity. So exactly the same trick can be applied again! So it turns out that the above paragraph is very pessimistic; it takes much less than 15 multiplications and 7 additions for each of the first four eighth roots.

Even if you, from the start, are clever about how you evaluate $f$ by, say, using your calculated $x^4$ when you find $x^5$ (the reduction from 28+7 to 15+7 operations is of this nature), the savings are considerable, as each step of recursion essentially halves the number of points you need to evaluate (more than a subtraction) at.

Consider, for instance, how this method would calculate its last point, $f(e^{7\pi i/4})$. It would be found as $g_e(e^{14\pi i/4})+e^{7\pi i/4}g_o(e^{14\pi i/2})$. However, we have $e^{14\pi i/4}=e^{6\pi i/4}$, which means that because we previously calculated $$f(e^{3\pi i/4}) = g_e(e^{6\pi i/4}) + e^{3\pi i/4}g_o(e^{6 \pi i/4})$$ we got almost everything we need, and get $$f(e^{7\pi i/4}) = g_e(e^{6\pi i/4}) - e^{3\pi i/4}g_o(e^{6 \pi i/4})$$This is what I mean by $f(e^{7\pi i/4})$ costing only a single subtraction.