[Math] How to find sequence of digits in pi

pi

I saw this project on github https://github.com/philipl/pifs, where they are trying to compress files in the pi number after the decimal. I guess this makes sense because apparently every finite sequence of digits exist in the never ending decimal numbers of pi. But I am trying to understand 1 step of their process.

So firstly from what I understand, if you want to compress a file, that is really represented in a sequence of numbers like say 471947…2846 (somehow gotten from base 16). Then, they assume that sequence is somewhere in pi.

They somehow then look up where the sequence starts in pi. This is the step I don't understand how they do. But they use a formula called Bailey–Borwein–Plouffe formula to do it.

So the compression is really two numbers $<A,B>$, where A is the index of the start number of pi, and B is the length needed.

To uncompress, its just a for loop, loop through every index from A, and repeat B times, and use that formula, to get the pi digit value, then convert it back to binary, and you have the original file again.

But it's that first step to find that initial start index using the formula I don't understand how that's done. Surely they don't brute force and try every combination in a linear fashion.

Does anyone know?

Thanks

Best Answer

The project you've found is a (deliberate!) joke.

It is true that $\pi$ is suspected to be normal in all bases, which would imply that every finite sequence of hex digits appears somewhere (indeed, many times) in the hexadecimal expansion of $\pi$.

But this cannot be used for compression -- the trouble is that the number $A$ that tells you where to find your file in $\pi$ will -- in the vast majority of cases -- be so large that storing $A$ takes up even more space than it would take to store the original file.

The BBP formula is not particularly suited for finding a particular sequence in $\pi$, except by trial and error, or by starting somewhere in $\pi$ and keep producing digits until you randomly come across the sequence you're looking for. So at first, the goal of the project would be completely impossible -- just locating a ten-byte file would take lifetimes. (That is, some multiple of the lifetime of the universe).

The kicker is in this part of the description:

Now, we all know that it can take a while to find a long sequence of digits in $\pi$ so for practical reasons, we should break the files up into smaller chunks that can be more readily found.

In this implementation, to maximise performance, we consider each individual byte of the file separately, and look it up in $\pi$.

So it doesn't actually "compress" anything -- it just stores, for each byte in the file, a position in $\pi$ where that particular byte can be found. (And finding such short a segment is certainly doable by brute force). But storing such a position takes more than a byte. So all in all it's just a simple substitution code, with a particularly inefficient implementation.

(And then there's some further joking around, claiming that the indices don't count as space used because they're "metadata".)