[Math] The behavior of a certain greedy algorithm for Erdős Discrepancy Problem

co.combinatoricsnt.number-theorypolymath5

Let $N$ be a positive integer.
We want to find a completely multiplicative functions $f(n)$ with values $\pm 1$ for $n \le N$ such that the discrepancy
$$D=\max_{n \le N} |\{\sum_{i=1}^nf(i)\}|$$
is as small as possible. This is Erdős Discrepancy problem for multiplicative functions.

Consider the following greedy algorithm:

After you assigned the values $f(2),f(3),\dots f(p_i)$ for the first $i$ primes assign the value $f(p_{i+1})$ so as to minimize the maximum discrepancy $|\{\sum_{i=1}^nf(i)\}|$ in every partial sum where unassigned entries of $f$ get the value zero.

Question: How does this greedy algorithm perform?

Experimental or heuristic answers as well as rigorous proofs are welcomed.

For more background and related questions see this post .

Variation

Consider the same greedy algorithm when you impose the condition that $f(m)=0$ unless $m$ is square free. (If $m$ is not square free $f$ is multiplicative and has values $\pm1$.)

Question: How does our greedy algorithm performs on the square-free version?

Namely, we would like to understand the behavior of the discrepancy of the function obtained by our greedy algorithm. While for EDP there are known examples with $\log N$ discrepancy, this is not known for the square-free version.

Update:

The very nice answer by rlo suggests that the greedy algorithm gives discrepancy close to $n^{1/3}$ or so, and rlo expects it also for the square free variation. Can an upper bound of $N^{1/2-\epsilon}$ be proved? What about a lower bound of $N^{\epsilon}$. Another interesting question is if you can improve the greedy algorithm to get lower discrepancy. Our greedy ignore 0's in intervals. A greedy algorithm that ignore intervals with 0's was considered in polymath5 and to the best of my memory achieve discrepancy $n^{1/2}$. Maybe a clever interpolation between these two variants will do a better job than both?

Further meditation and a new variant

It seems that in our greedy algorithm the decisions we make for small primes are fairly irrelevant. A way to check it:

Run the algorithm for N and test what is the discrepancy for an interval [1,T] where T is,
say, $\sqrt N$. I would expect the answer to be roughly $\sqrt T$.

So now we can think about the following variation:

Let $a>1$ be a real number. We run the greedy algorithm above but our decision for $f(p)$ is based only on intervals $[1,n]$ where $n \le p^a$. (Of course we consider only $n \le N$.

Questions: Can this variant lead to lower discrepancy?

What is the optimal value of $a$?

Best Answer

Update 2: Original answer below. I've put together graphs showing more than just champions, using every $N\leq 10^4$ and also every $N\equiv 0\pmod{100}$ up to $10^5$. This is for the original version, not the variant, but I'd expect that to be essentially the same. I am starting to be somewhat skeptical of my $1/3$ estimate. I'll collect some data further out, but it'll be less complete since it's more costly to collect.

Here's the graphs of $D(N)$ versus $N$. The added curves are $N^{1/3}$ and $\log N$. enter image description here

Here's the graph of $\log D/\log N$ versus $N$. The horizontal line is at $1/3$. (Note the different scales.) enter image description here

Original answer: I have some basic numerical observations. I hacked together some code in c++ to work on this, and would be happy to collect more focused data or to share the code. Also, whenever there was a choice of whether to assign the value $+1$ or $-1$ to $f(p)$, I chose $-1$ for consistency of output. This yields a well-defined function $f_N$ for each $N$. Let $D(N)$ denote the discrepancy of $f_N$ up to $N$.

(I've also looked at choosing $f(p)=+1$ if it's undetermined, and at $f(p)=\pm 1$ according to whether $p\equiv 1,3\pmod{4}$. In these cases, the data is essentially the same as below.)

$D(N)$ is roughly increasing, but is not monotonic. Its champion values for $N\leq10000$ are, in the form $(N,D(N))$, $(1,1)$,$(10,2)$,$(24,3)$,$(70,5)$,$(91,6)$,$(391,7)$,$(553,8)$,$(668,9)$,$(961,10)$,$(1235,11)$,$(1265,13)$,$(2561,14)$,$(2604,17)$,$(6275,18)$,$(6276,19),\dots$. This growth is more than logarithmic, and is probably polynomial -- $\log D/\log N$ hovers pretty close to $1/3$ for each of these points, so that may be the answer for the $\Omega$ result.
It appears that the functions $f_N$ may converge as $N\to\infty$, but I'm not sure of this and need more data. Let $l(N)$ denote the least prime $p$ for which the value of $f_N(p)$ is undetermined. Certainly $l(N)\geq 5$ once $N\geq 4$, and while there is a great deal of fluctuation, it appears that maybe $l(N)\geq 7$ once $N\geq40500$; I will be seeing if this is (numerically) true.

I hope to update this answer once I have more data.

Update: For the variation where we only look at the values of $f_N$ on squarefree integers, the behavior appears to be the same.

$(N,D(N),\log D/\log N)$:

$(1,1,\text{NaN})$
$(30,2,0.2037950471)$
$(42,3,0.2939297479)$
$(77,4,0.3191428313)$
$(190,5,0.3067334722)$
$(238,6,0.3274252273)$
$(319,8,0.3606890916)$
$(939,9,0.3210056698)$
$(1358,10,0.3191931033)$
$(1461,11,0.3290703914)$
$(2185,13,0.3335707591)$
$(2769,14,0.3329519195)$
$(3354,15,0.3335896252)$
$(3689,17,0.3449622741)$

Related Solutions

[Math] The conjecture of Montgomery and Soundararajan on primes in short intervals: Empirical inconsistencies

There are lower order terms in the work of Montgomery and Soundararajan that may account for the discrepancies you're observing. If you look at Theorem 3 of the paper that you linked, you'll find that the standard deviation should really be $$ \frac{\sqrt{y (\log \frac xy +B)}}{\log x}, $$ where $B=1-\gamma-\log (2\pi)=-1.415\ldots$. This is asymptotically the same as what you have, but numerically the second order term can make a difference. Note also that the lower order term becomes more significant as $y$ gets larger, which is a feature that you see in your data.

So this is really a question for you: whether taking the new standard deviation with lower order terms gives you values for $\sigma$ closer to $1$. I'd be curious to know the revised numerics.

[Math] Small quotients of smooth numbers

It seems unlikely that one can prove anything nontrivial, but it's still interesting to consider what ought to be true, and to experimentally compute for small $k$.

Let

$$ \delta_k = \min_{\ell_1<\ell_2} \left(\frac{n_{\ell_2}}{n_{\ell_1}} - 1\right) = \min_{\ell} \left(\frac{n_{\ell+1}}{n_\ell} - 1\right), $$ so we're asking how small $\delta_k$ can get. An easy upper bound is $\delta_k \ll k \log k \, / \, 2^k$, and one can probably save some power of $k$. The right answer is probably $\delta_k \sim C^{-k + o(k)}$ for some constant $C>2$, and it seems reasonable to guess that $C=3$. I'll explain this next, followed by computational techniques that make it feasible to determine $\delta_k$ at least for $k \leq 36$; for example $$ \delta_{28} = \delta_{29} = \delta_{30} = \frac1{1079415718589} \doteq 3^{-25.22} $$ (the numerator is $13 \cdot 53 \cdot 59 \cdot 61 \cdot 67 \cdot 73 \cdot 89 = 2 \cdot 3 \cdot 5 \cdot 11 \cdot 17 \cdot 19 \cdot 31 \cdot 43 \cdot 71 \cdot 107 - 1$), and $$ \delta_{36} = \frac{145948}{123657879146878688901} \doteq 3^{-31.29} $$ with $$ 1 + \delta_{36} = \frac {7 \cdot 13 \cdot 19 \cdot 37 \cdot 41 \cdot 47 \cdot 73 \cdot 83 \cdot 89 \cdot 97 \cdot 127 \cdot 151} {3 \cdot 17 \cdot 23 \cdot 43 \cdot 59 \cdot 61 \cdot 67 \cdot 71 \cdot 79 \cdot 101 \cdot 131 \cdot 137}. $$

For the upper bounds: Note that $\delta_k$ is essentially $\min_{\ell_1<\ell_2} (\log n_{\ell_2} - \log n_{\ell_1}).$ There are $2^k$ numbers $\log n_\ell$ between $\log n_1 = 0$ and $\log n_{2^k} = \sum_{i=1}^k \log p_i \sim k \log k$; so when we list them in order the average difference is $\log n_{2^k} \, / \, (2^k - 1) \sim k \log k \, / \, 2^k$, and so there must be some difference(s) no larger than that. To save a power of $k$, note that the variance of the $2^k$ numbers $\log n_\ell$ is $\frac14 \sum_{i=1}^k \log^2 p_i \sim (k/4) \log^2 k$, and a positive fraction of them must be within say two standard deviations of the mean, so we get an upper bound $\sim k^{1/2} \log k \, /\, 2^k$.

For the heuristics: If we had $2^k$ random numbers in an interval, we'd expect the closest pair to be about $4^{-k}$ apart. But the separations aren't independent; there are only $(3^k-1)/2$ different ratios $n_{\ell_2} / n_{\ell_1}$ (namely the values of $\prod_{i=1}^k p_i^{\alpha_i}$ with each $\alpha_i \in \{-1, 0, 1\}$ that make the product $>1$), so we expect the smallest one to have logarithm about $3^{-k}$, again up to subexponential factors.

For small $k$ we can compute $\delta_k$ exactly by listing the $2^k$ factors of $\prod_{i=1}^k p_k$, sorting them, setting $\delta=2$, comparing each $n_{\ell+1} / n_\ell$ with the current value of $\delta$, and if $n_{\ell+1} / n_\ell$ is smaller then making it the new $\delta$. This takes about $2^k$ space and $k 2^k$ time.

We can reduce each factor $2^k$ to $3^{k/2}$ by splitting $\{p_1,\ldots,p_k\}$ into two equal or nearly equal subsets $P_1,P_2$, listing for $j=1,2$ all the $3^{P_j}$ rationals of the form $\prod_{p \in P_j} p^{\alpha_p}$ with each $\alpha_p \in \{-1, 0, 1\}$, merging and sorting the two lists, and minimizing over ratios between consecutive elements of different lists. This increases the feasible range by a factor of $\log_3 4 = 1.26\!+$, and is how I computed $\delta_k$ for $k \leq 36$ (in a few hours running gp on a computer on which I could allocatemem(2^37)). We next tabulate, for each $k \leq 36$, the values of $\log_3 (1 / \delta_k)$ (which does seem reasonably close to $k$), followed by the difference between the two $n_\ell$ with the ratio closest to $1$ and the values of those two $n_\ell$. When $\delta_k = \delta_{k-1}$ we use " marks instead of repeating a row.

 1 |  0      1  2  1
 2 |  0.631  1  3  2
 3 |  1.465  1  6  5
 4 |  2.402  1  15  14
 5 |  2.771  1  22  21
 6 |  3.954  1  78  77
 7 |  5.981  1  715  714
 8 |    "    "   "   "
 9 |    "    "   "   "
10 |  7.030  1  2262  2261
11 |  8.559  1  12122  12121
12 |    "    "    "      "
13 | 10.491  1  101270  101269
14 | 10.765  7  958341  958334
15 | 13.277  1  2162095  2162094
16 | 13.385  9  21894574  21894565
17 |   "     "     "         "
18 | 14.237  269  1669770410  1669770141
19 | 15.039  296  4432525097  4432524801
20 | 16.459  95   6768250181  6768250086
21 | 17.492  1    221669903  221669902
22 | 17.989  479  183357752669  183357752190
23 | 20.727  1    7746395147  7746395146
24 | 20.899  241  2256564888159  2256564887918
25 | 22.260  31   1293752274846  1293752274815
26 | 22.260  31   1293752274846  1293752274815
27 | 23.709  8    1641739926263  1641739926255
28 | 25.220  1    1079415718590  1079415718589
29 |   "     "          "              "
30 |   "     "          "              "
31 | 28.015  3749    87225268563485259  87225268563481510
32 | 29.352  699715  70660131241710008586  70660131241709308871
33 | 30.221  208586  54759581443774708307  54759581443774499721
34 |   "       "              "                     "
35 | 31.240  4       3216928369004441  3216928369004437
36 | 31.288  145948  123657879146878834849  123657879146878688901

This seems to agree with the computations of Gerhard Paseman up to $k=20$ (except for the error already noted in the final line). I couldn't find a sequence in OEIS that matches any of it.

One could push this computation further using the data structure described in this paper by D. J. Bernstein, which reduces the space requirement from $3^{k/2}$ (or $2^k$) to the square root $3^{k/4}$ (or $2^{k/2}$) without appreciably increasing the running time. I haven't tried to implement this.

Finally, for yet larger $k$ one could probably still exhibit some values of $n_{\ell+1} / n_{\ell}$ that are reasonably close to $1$ using algorithms such as LLL to find approximate integer relations on $\{ \log p_i \mid 1 \leq i \leq k \}$ (though it would be harder to prove that one has found the minimal one). I have not tried to do this either.