Is the sum of binomial coefficients over square free integers normally distributed

binomial-coefficientsnormal distributionnumber theorystatisticssummation

I observed experimentally that the sum of binomial coefficients over square free integers approximately fits a normal distribution. Can this be proved or disproved theoretically?

Let $\mu(r)$ be the Mobius function. Define

$$
A_n =
\mu(1){n\choose 1} + \mu(2){n\choose 2} + \mu(3){n\choose 3} + \cdots + \mu(n){n\choose n}
$$

$$
B_n =
\mu(1)^2{n\choose 1} + \mu(2)^2{n\choose 2} + \mu(3)^2{n\choose 3} + \cdots + \mu(n)^2{n\choose n}
$$

Note that $B_n$ is nothing but the sum of the Binomial coefficients over square free integers.

Claim 1: The sequence of numbers $\dfrac{A_n}{2^n}$ is normally distributed with a mean $0$.

Claim 2: The sequence of numbers $\dfrac{\zeta(2)B_n}{2^n}$ is normally distributed with a mean $1$.

I do not have a closed form for the standard deviation in terms of well known constants and functions. As a illustration, given below is the histogram for $\frac{\zeta(2)s_n}{2^n}$. The blue dots are the actual distribution while the red line represents a perfect normal distribution with the parameters $a,b$ and $c$ given below.

Update: Normality tests done for $n \le 10^5$ and the observation is that as increases, the distribution fits a normal distribution better

Best Answer

This is an extended comment rather than a complete answer.

I understand that you want to find the limiting distribution. Below are the results for a maximum $n$ of 10,000 (along with the associated Mathematica code):

(* Generate data and moments *)
nMax = 10000;
\[Mu] = Table[MoebiusMu[i]^2, {i, nMax}];
s[n_] := Zeta[2] Sum[MoebiusMu[i]^2 Binomial[n, i]/2^n, {i, nMax}]
data = Table[{n, s[n]}, {n, 1, nMax}];
moments = Table[{n, Mean[data[[Range[n], 2]] // N],
    StandardDeviation[data[[Range[n], 2]] // N],
    Skewness[data[[Range[n], 2]] // N],
    Kurtosis[data[[Range[n], 2]] // N]}, {n, 2, nMax}];

I've generated the mean, standard deviation, skewness, and kurtosis values for $n=2$ through $n=10000$. If the limiting (or approximating distribution function) is normal, then the skewness should settle towards zero and the kurtosis settle towards 3. Here are the resulting figures:

ListPlot[{data, {{1, 1}, {nMax, 1}}}, Joined -> True, 
 AspectRatio -> 1/4,
 ImageSize -> 1000, Frame -> True, 
 FrameLabel -> (Style[#, Bold, 18] &) /@ {"n", 
    "\[Zeta](2)s(n)/\!\(\*SuperscriptBox[\(2\), \(n\)]\)"},
 PlotStyle -> Thickness[0.005], ImagePadding -> 50, PlotRange -> All]
plotIt[m_, label_, level_] := 
 ListPlot[{moments[[All, {1, m}]], {{2, level}, {nMax, level}}},
  Joined -> True, PlotRange -> All, Frame -> True, 
  FrameLabel -> (Style[#, Bold, 18] &) /@ {"n", label},
  AspectRatio -> 1/4, PlotStyle -> Thickness[0.005], 
  ImagePadding -> 50, PlotRange -> All, ImageSize -> 1000]
plotIt[2, "Mean", 1]
plotIt[3, "Standard deviation", 0.01078]
plotIt[4, "Skewness", 0]
plotIt[5, "Kurtosis", 3]

While the above figures don't rule out a normal distribution (or that a normal distribution might provide a reasonable approximation for the proportion of numbers between any two specified values), that the skewness does not seem to be approaching zero and that the kurtosis is drifting farther away from 3 does not support a normal distribution as the limiting distribution. Maybe a slightly skewed and heavier-tailed distribution might be a better candidate for the limiting distribution.

From other posts I get the impression that you have values up to $n=44,000$. Similar figures as above might also be suggestive with that larger data set.

Related Solutions

[Math] Normality test vs. Fitting a Gaussian curve

You say nothing about the sample size. It seems the dotted line in your plot is essentially a histogram with dots from the middle of the tops of the bars. To get such a smooth result the sample size must be large.

Small samples. For small samples, it can be very difficult to judge normality. A formal test, such as a Shapiro-Wilk or Anderson-Darling test, has very poor power for small samples. A p-value above 0.05 can be interpreted as 'consistent with normal', but a small sample might also be consistent with lots of other distributional models.

As an example, I generated a random sample of size $n = 20$ from $\mathsf{Unif}(0,1),$ and did a Shapiro-Wilk test of normality. The p-value was about $0.29 > .05$ so this sample known to be from a uniform population is judged as 'consistent with normal'. [Code for this experiment in R statistical sofware follows.]

x = runif(20); shapiro.test(x)  

        Shapiro-Wilk normality test

data:  x
W = 0.94402, p-value = 0.2852

There is not much use making a histogram or a normal probability plot for such a small sample (except perhaps as a drill problem for homework in an elementary statistics course). Here is a stripchart of the 20 observations tested above.

stripchart(x, pch=19)

This was the first uniform sample of size 20 I tried. Was 'consistency' with normal a 'lucky' result that just happened to make my point? The answer is No: A simulation with 10,000 such samples of size 20 showed that about 80% 'pass' as 'consistent with normal'.

Why do we care whether the population from which we are sampling is normal? Often because we wonder whether it is OK to use normal-based inferential procedures, such as a t test or t interval. Unless there is marked evidence that a small sample is pretty clearly not normal (such as remarkable outliers or obvious skewness), most texts say it is OK to use t procedures.

A 95% t confidence interval for the mean of the data above is $(0.39, 0.65),$ which is hardly a 'sharp' interval, but does include the true population mean $\mu = .5.$ Of course, if these were real data (not simulated using known parameters), we would never know for sure that the Ci contains the true value of $\mu.$ [A nonparametric Wilcoxon signed-rank 95% CI for the population median is $(.38, .66).$]

Large samples. For large samples histograms and Q-Q plots are often useful. However, various normality tests such as Shapiro-Wilk may to often reject a large sample, which we believe must be normal, as not 'consistent with normal'.

For example, here are results for a known sample of size $n = 1000.$

y = rnorm(1000, 100, 15);  shapiro.test(y)

        Shapiro-Wilk normality test

data:  y
W = 0.99716, p-value = 0.07436

The p-value 0.07 is still above 0.05, but small enough to make one wonder if the data may not be normal (if we hadn't just simulated it to be normal). Here is a histogram with the best-fitting normal density curve (not exactly an ideal fit) and a normal probability plot (points not as in quite as straight a line as one might prefer).

Why do we care about normality? If we are doing t procedures, the sample is large enough to expect very good results. For example, a 95% t confidence interval is $(99.6, 101.5),$ which is relatively short and contains the true mean $\mu = 100.$ However, if these are IQ scores of 1000 students, it may be worth noting that there are a few more students just below 100 than we might expect, and a few less just above 100.

Usually, samples of size 1000 generated to be normal are better behaved than the one in the example just above. I discarded three simulated samples in order to show in this Answer my fourth example that is not 'textbook perfect'.

Addendum per Comments: Consider a sample of size $n = 10,000$ from $\mathsf{Norm}(0,1).$

z = rnorm(10^4);  shapiro.test(z[1:5000])

        Shapiro-Wilk normality test

data:  z[1:5000]
W = 0.99962, p-value = 0.4764

In R, the Shapiro-Wilk test is limited to 5000 observations; here, the first half of the data are consistent with normal. The Shapiro-Wilk test uses some approximations; even so in 10,000 tests on normal samples of size $n = 5000,$ the number of false rejections was about 4.3% (near to 5%).

pv = replicate(10^4, shapiro.test(rnorm(5000))$p.value)
mean(pv < .05)
## 0.0432

A 95% t confidence interval for the mean is $(-.003, .036),$ which is very short and contains the population mean $\mu = 0.$

In the figure below: The left panel shows a histogram of the sample along with the standard normal density (dashed red) and the kernel density estimate (solid dark green). For such a large sample population density and KDE almost match. [Very roughly, you can think of KDE as a way to 'smooth' a histogram. You may want to google KDE and/or read Silverman's excellent book.] The center panel shows the empirical CDF (ECDF) of the sample along with the CDF of standard normal. Information is lost in reducing data to histogram bins, but not in making the ECDF, so the ECDF is generally a better match to the population CDF than is the histogram is to the population density. [The ECDF sorts the data and jumps up by $1/n$ at each data value.] The right panel shows an (essentially linear) normal probability plot (Q-Q plot). Roughly speaking, a Q-Q plot is an ECDF with the 'theoretical quantile' scale distorted to give a (theoretically) linear plot for normal data.

[Math] How is salary a binomial distribution

Normal distributions can arise in other ways than as the limit of a binomial. In classes we are very prone to assume a normal distribution for some quantity because we have lots of theorems and z score tables that work with it. You should ignore the word salary and think "a random variable with given mean and variance" and prove or compute what you are asked for.

Salaries in particular do not follow a normal distribution. First, every normal distribution has some support below zero, but negative salaries are not realistic. Second, the tails are badly asymmetric. There is a small tail extending a huge number of standard deviations above the mean with many more events than the normal distribution predicts.

Best Answer

Related Solutions

[Math] Normality test vs. Fitting a Gaussian curve

[Math] How is salary a binomial distribution

Related Question