Derivative and calculus over sets such as the rational numbers

derivativesintegrationmeasure-theoryprobability distributionsprobability theory

I am interested in the derivative of a function defined on a subset $S$ of $[0, 1]$. The subset in question is dense in $[0, 1]$ but has Lebesgue measure zero. My actual question can be found at the bottom of this post.

There has been a few questions on the subject, but none leading to something interesting as far as I am concerned. See here, here (concept of arithmetic derivative) and also here (Minkowski's question mark function, related to the material discussed here.)

Generally these discussions lead to some kind of non-sense math. Here it is the opposite. I have a framework that does work as far as applications and computations are concerned, but I have a hard time putting it into some sound mathematical framework. It would have to be some kind of non-standard calculus.

Perhaps the simplest example (though I have plenty of other similar cases) is as follows. Let $Z$ be a random variable defined as follows:
$$Z = \sum_{k=1}^\infty \frac{X_k}{2^k}$$

where the $X_k$'s are identically and independently
distributed with a Bernouilli$(p)$ distribution. Thus $P(X_k = 1) = p$ and $P(X_k = 0) = 1-p$. Here $0 < p < 1$. In short, the $X_k$'s are the binary digits of the random number $Z$.

There are two cases.

Case $p=\frac{1}{2}$

In this case, $Z$ has a uniform distribution on $S$, where $S$ is the set of normal numbers in $[0, 1]$. It is known that $S$ has Lebesgue measure $1$, and that $S$ is dense in $[0, 1]$. Yet it is full of holes (no rational number is is a normal number due to their periodicity, thus the $X_k$'s are not independent for rational numbers.)

This is the simplest case. One might wonder if the density $f_Z$ (the derivative of the distribution $F_Z$) exists. Yet $f_Z(z) = 1$ if $z \in S$ works perfectly well for all purposes. It can easily be extended to $f_Z(z) = 1$ if $z \in [0, 1]$. Let us denote the extended function as $\tilde{f}_Z$. You can compute all the moments using the extended $\tilde{f}_Z$ and get the right answer. If $s$ is a positive real number, then
$$E(Z^s) = \int_0^1 z^s \tilde{f}_Z(z) dz = \frac{1}{s+1}.$$

You could argue that $\tilde{f}_Z$ (and thus $f_Z$) can be obtained by inverting the above functional equation, using some kind of Laplace transform. So we can bypass the concept of derivative entirely, it seems.

Case $p\neq \frac{1}{2}$

Now we are dealing with a hard nut to crack, and a wildly chaotic system: $Z$'s support domain is a set $S'$ that is a subset of non-normal numbers in $[0, 1]$. This set $S'$ has now Lebesgue measure zero, yet it is dense in $[0, 1]$. For the distribution, it is not a problem: even discrete random variables have a distribution $F_Z$ defined for all positive real numbers: $F_Z(z) = P(Z \leq z)$.

The issue is with the density $f_Z = dF_Z/dz$. It sounds either it should be zero everywhere or not exist. My guess is that you might be able to define a new, workable concept of density. In the neighborhood of every point $z \in S'$, it looks like $g(z,h) = (F_Z(z+h) – F_Z(z))/h$ oscillates infinitely many times with no limit as $h\rightarrow 0$, yet these oscillations are bounded most of the time, perhaps leading to the fact that averaging $g(z, h)$ around $h = 0$, using smaller and smaller values of $h$, could provide a sound definition for the density $f_Z$.

Again, despite the chaotic nature of the system (see how the the would-be density could potentially look like in the picture below) all the following quantities exist and can be computed exactly and then confirmed by empirical evidence, even though the integrals below may not make sense:

$$E(Z) = \int_{0}^{1} z f_Z(z) dz = p \\
E(Z^2) = \int_{0}^{1} z^2 f_Z(z) dz =\frac{p}{3}(1+2p)\\
E(Z^3) = \int_{0}^{1} z^3 f_Z(z) dz =\frac{p}{7}(1+4p+2p^2)\\
E(Z^4) = \int_{0}^{1} z^4 f_Z(z) dz =\frac{p}{105}(7+46p + 44p^2+8p^3)
$$
Indeed, a general formula for $E(Z^s) = \int_0^1 z^s f_Z(z)dz$ is available for $s \geq 0$, defined by the following functional equation (see here):
$$E(Z^s) = \frac{p}{2^s-1+p}\cdot E((1+Z)^s) .$$

In other words, we would have, under some appropriate calculus theory with a sound definition of integral and derivative:
$$\int_{S'}z^s f_Z(z)dz = \frac{p}{2^s-1+p}\cdot\int_{S'}(1+z)^s f_Z(z) dz .$$

Here is how the density $f_Z$, if properly defined, could look like for $p=0.75$ (see here and here):

Below is the empirical percentile distribution, for this particular $Z$:

Other related problems

If instead, we consider the model $Z = X_1 + X_1 X_2 + X_1 X_2 X_3 + \cdots$ with $X$ Bernouilli$(p)$, then $Z$ has a geometric distribution of parameter $1-p$, see section 2.2 in this article. This system is also equivalent to the binary numeration system discussed so far: see section 5 in the same article. It results in $Z$ having a standard, well-known discrete distribution. But if this time $P(X=-0.5) = 0.5 = P(X=0.5)$ then $Z$ is uniform on a subset $S$ of $[-1, 1]$, with $S$ also full of holes.

Here is another interesting model:

$$Z=\sqrt{X_1+\sqrt{X_2+\sqrt{X_3+\cdots}}}$$

The distribution for $X$ is as follows:
$$P(X=0) = \frac{1}{2}, P(X=1) = \frac{1}{1 + \sqrt{5}}, P(X=2) = \frac{3 – \sqrt{5}}{4} \mbox{ } (\star)$$

This corresponds to a different numeration system, and the choice for $X$'s distribution is not arbitrary, see here: in short, it makes the system smoother and possibly easier to solve. To the contrary $P(X=0) = P(X=1)=$ $P(X=2)= \frac{1}{3}$ yields a far more chaotic system, and the case $P(X=0) =$ $P(X=1) = \frac{1}{2}$ is so wild that the support domain of $Z$ has huge gaps, very visible to the naked eye.

Normal numbers in the nested square root system are very different from normal numbers in the binary numeration system. The successive digits have a very specific auto-correlation structure, and the digits $0, 1, 2$ are not evenly distributed for normal numbers in that system. It is clear that if we assume that the $X_k$'s are i.i.d, then $Z$ is not a normal number in that system. Yet we get a very good approximation for $F_Z$, much smoother (at least visually) than in the binary numeration system investigated earlier, with $p\neq \frac{1}{2}$. In particular, $F_Z$ is very well approximated by a log function, see chart below.

Here the blue line is the empirical distribution for $Z$, the red line is the log approximation. And below is a spectacular chart, featuring the approximation error $\epsilon(z) = F_Z(z) -\log_2(z)$. It's a fractal! (source: see section 2.2 in this article). In short, it is no more differentiable than a Brownian motion, and technically, the derivative $f_Z$ does not exist. Yet all moments of $Z$ can be computed exactly from the functional equation attached to that system ($F_{Z^2}=F_{X+Z}$) and confirmed empirically. Even though the distribution looks smooth to the naked eye, we are dealing here with a very chaotic system in disguise. Again we need non-standard calculus to handle the density, whose support is a dense set of Lebesgue measure zero in $[1, 2]$.

Since fractals are nowhere differentiable, $f_Z$ does not exist. Yet one could imagine a "density" that would look like $f_Z(z)=\frac{1}{z\log 2}$ for $z\in [1,2]$.

My question

Is there an existing theory to handle this type of density-like-substance? Following some advice, I also posted this question on MathOverflow, here.

Best Answer

It really looks to me that you're after the concept of a generalized function which allow you to manipulate things that are decidedly not functions in a way that really looks like they are functions. The basic premise of a generalized function is that you know how they're supposed to integrate against nice functions, but they clearly are not functions themselves - which is exactly the path you seem to be going down. You can also think of them as what happens if you keep differentiating a continuous function - much in the same way that "a measure" is a reasonably good answer to the question of what happens if you differentiate an increasing function.

More formally, to define a generalized function, we start by defining some functions which are very well behaved, known as test functions. A function $f:\mathbb R\rightarrow\mathbb R$ is a test function if and only if it is smooth and compactly supported. The set of test functions is known as $T$. We also put a topology on $T$ by the rule that $f_1,f_2,\ldots$ converges to $f$ if and only if, for each $n$, the $n^{th}$ derivatives of this sequence converge uniformly to the $n^{th}$ derivative of $f$.

A generalized function is then just a continuous map from $T$ to $\mathbb R$. The idea is that we would associate a continuous function $g:\mathbb R\rightarrow\mathbb R$ to the map $$f\mapsto \int f(x)g(x)\,dx.$$ More generally, we can associate to each measure $\mu$ the generalized function $$f\mapsto \int f(x)\,d\mu.$$ The really neat thing about this definition is that we can use tricks to start to reason about distributions the same way we would reason about functions; let me focus on your second example as it's a nice example of how this reasoning works out without too much difficulty.

So, first, we would like to define the derivative of a generalized function $g$. You can either think about this either as a convenient extension of the map $g\mapsto g'$ from the subset of generalized functions which are actually differentiable to the entire set, or as, noting that differentiation is linear, taking the transpose of the map that takes $g\mapsto g'$ in the set of test functions taken as an inner product space - or you can just think about integration by parts as justification. In any case, you should come across the following formula (where we now abuse the integral sign $\int g' \cdot f$ to formally mean $g'(f)$ - but avoid this notation because it is more confusing than abusing notation!): $$\int g'\cdot f = -\int g\cdot f'.$$ This is, of course, actually true if $g$ and $f$ are differentiable and their product is compactly supported - being a consequence of integration by parts - and one can (and should) use it to differentiate distributions in general.

You can also define compositions of distributions with nice functions in a similar manner; let's stick with linear functions for simplicity. In particular, if $g(x)$ is a distribution, lets define $g(2x)$ in the same way as before (abusing notation even more now - just remember that $g(x)$ and $g(2x)$ are not truly functions, but symbols of their own right): $$\int g(2x)\cdot f(x)\,dx = \frac{1}2\int g(x)\cdot f\left(\frac{x}2\right)\,dx$$ which is essentially the result of a $u$-substitution. Similarly, we could define $$\int g(2x-1)\cdot f(x)\,dx = \frac{1}2\int g(x)\cdot f\left(\frac{x+1}2\right)\,dx.$$

So, even though we can't plot these things (though, if we're at least still within the realm of measure theory - which these density functions absolutely are - we can make histograms which look like what you've done), we can reason about them by analogies to functions that are rooted formally in a theory that will let us extract results via integration later. In particular, our theory lets us, without reservation, differentiate any increasing function - and thus fills in what your values $f_Z$ have to be.

So, let's let $G_Z$ be the cumulative distribution given by randomly selecting the bits of a number in $[0,1]$ from a Bernoulli distribution with parameter $p$. While we could notice some properties of $G_Z$ directly and infer things from differentiation, let's instead look at some more interesting reasoning directly on $g_Z$ enabled by this.

So, as a first step, let's let $Z_n$ be a discrete random variable given as a base two value $0.b_1b_2b_3\ldots b_n$ of bits chosen from the given Bernoulli distribution. Then, $g_{Z_n}$ will represent our probability mass function and we will take it to satisfy: $$\int g_{Z_n}(x) \cdot f(x) = \mathbb E[f(Z_n)].$$ This is, of course, just summing up some of the values of $f$. However, it is helpful, because to produce the distribution of $Z_{n+1}$, all we need to do is randomly pick the first bit, then add half of a value chosen as in $Z_n$. This works out to the following equation: $$g_{Z_{n+1}}(x) = 2(1-p)g_{Z_n}(2x) + 2p\cdot g_{Z_n}(2x-1).$$ which can be explicitly verified in terms of the expectations, if desired. However, the final distribution you want ought to be (and is, in pretty much any topology one might put on the generalized functions) is the limit of $g_{Z_n}$ as $n$ goes to $\infty$ - and the operations here are continuous, so we will get $$g_{Z}(x) = 2(1-p)\cdot g_Z(2x) + 2p\cdot g_Z(2x-1)$$ which is exactly the sort of algebraic equation that would lead to the sort of structures you are seeing - and gives you the iterative formula that leads to the sort of density plots you've made. Of course, as a measure, this is supported on the set of numbers $x$ in $[0,1]$ where the asymptotic density of the set of indices where the corresponding bit of $x$ in binary is $1$ equals $p$ - so can't be converted to a function (even via tools like the Radon-Nikodym derivative which converts from measures to functions where possible) - but, nonetheless, these generalized functions can still provide some of the framework you seem to be after.

As an aside, you can work with measures this way too; if you let $\mu$ be the probability measure associated to the process we've been discussing, you can write $$\mu(S) = p\mu\left(\frac{S}2\right) + (1-p)\mu\left(\frac{S+1}2\right)$$ which has the same structure. If you're happy to work with measures and don't want to try differentiating the measure any further, then they're a good answer to your search as well.

I might note that these absolutely cannot see the sort of holes that you are talking about - but every finite measure on $\mathbb R$ is supported on a totally disconnected set (because only countably many points can have positive finite mass, and any countable set of points not including such a point has no measure). Generally, with probability, the "blurring" effect of using smooth test functions is actually desirable, because the measure of an interval is not the sum of the measures of its points. This is also why one rarely every works with functions when dealing with measures, but instead works with $L^p$ spaces (whose members are not functions, but rather almost-everywhere equivalence classes of functions).

To say this more generally: a function is defined by the fact that it can be evaluated at any point. This is not really something you desire when talking about probability distributions, because the quantity is too spread out to register evaluation. A measure, more usefully, is defined by the fact that it can integrate against continuous functions (e.g. as in the Riesz representation theorem) and then a distribution is a generalization which can integrate against smooth functions. The latter two objects tend to be more useful in these situations.

Best Answer

Related Solutions

The density of a spike-and-slab distribution

Related Question