[Math] Conditional Expectation and the floor function

integrationmathematical modelingprobabilityrandom

I have a piece of code the produces random integers.

    int c = 250 - (int) floor(rand() * 50);
    while (c < 0) {
        c = 250 - (int) floor(rand() * 50);
    }

The floor() function returns the largest floating-point value less than or equal to the argument and that is equal to an integer. The rand() function returns floating-point values drawn from the standard normal distribution. Let's assume that both functions work as intended and that the pseudo random number generator works correctly, otherwise further discussion would be pointless.

The question is, what is the expectation of the integers generated by this piece of code?

Casting the code in mathematical terms, I want to calculate the conditional expectation
$\mathbb{E}\bigl[250-\lfloor50\cdot Z\rfloor\quad\vert\quad 250-\lfloor50\cdot Z\rfloor\ge0\bigr]$ where the random variable Z is standard normal distributed, that is $Z\sim N(0,1)$. The condition stems from the fact, that the code above throws away negative integers and draws again.

Note that $\lfloor250 – 50\cdot x\rfloor=250-\lfloor 50\cdot x\rfloor$ holds, so the placement of the floor function shouldn't make a difference.

The question at this point is, is this mathematical model correctly describing what is going on in the code? I think so, if you think otherwise please leave a comment.

The reason for the linear transformation $250-50\cdot x$ above is simply that via rand() only a standard normal distribution is available. And $X=\mu-\sigma\cdot Z$ just turns an $N(0,1)$ distribution into an $N(\mu,\sigma^2)$ distribution, here $N(250,50^2)$, so the standard normal distribution is shifted to the right and widened.

Rejecting negative integers corresponds to truncating $N(250, 50^2)$ at $0$ removing the lower tail. Neglecting the floor function for a moment, my intuition is, that the truncation of the distribution and the necessary rescaling should move the mean further to the right, that is $\mathbb{E}\bigl[X\bigr]>250$ when $X\sim N_{\ge0}(250,50^2)$. Indeed Mathematica confirms

N[Expectation[x,  x \[Distributed] TruncatedDistribution[{0, \[Infinity]}, 
TransformedDistribution[250 - 50*x, x \[Distributed] NormalDistribution[]]]], 30]

250.000074335997045245285622087

By the definition of the expectation for a continuous distribution we have
$$
\mathbb{E}\bigl[250-50\cdot Z \quad \vert \quad 250-50\cdot Z\ge0\bigr]=\mathbb{E}[X]=\int_{-\infty}^\infty x f(x)\;dx
$$
with $Z\sim N(0,1)$ and $X\sim N_{\ge0}(250,50^2)$. The pdf $f(x)$ of the truncated and transformed standard normal distribution is
$$
f(x)=\frac{g(x)}{1-\Phi\left(\frac{0-250}{50}\right)}\qquad g(x)=\begin{cases}0&x<0\\\phi\left(\frac{x-250}{50}\right)&x\ge0\end{cases}
$$
where $\phi(x)$ and $\Phi(x)$ are the pdf and cdf of the standard normal distribution. The scaling factor $1-\Phi\left(\frac{0-250}{50}\right)$ is just the probability mass that is left after truncation. Dividing by it ensures that we have a distribution. And indeed
$$
\int_0^\infty f(x)\;dx=1
$$
is checked easily.

Now the only way I know of incorporating the floor function, is by treating it as what it is, a piecewise constant function. The next goal is to turn this continuous distribution into a step function. Each step collects all the floating-point values that are mapped by the floor function onto the same integer. Therefore
$$
h(a)=\int_a^{a+1}f(x)\;dx
$$
is the height of the step $[a,a+1]$ where $a\in\mathbb{N_0}$. Checking that the step function is a probability distribution yields
$$
\sum_{a=0}^\infty h(a)=1
$$
Now the sought-after expectation should simply be
$$
\mathbb{E}\bigl[250-\lfloor50\cdot Z\rfloor \quad \vert \quad 250-\lfloor50\cdot Z\rfloor\ge0\bigr]=\sum_{a=0}^\infty a\cdot h(a)
$$
A quick computation with Mathematica evaluates this sum to be

249.5000743384745...

Which is puzzling me for two reasons.

  1. My intuition is that the mean should shift to the right, not left, despite the flooring.
  2. Actually executing the code, letting it produce a sample of random integers of size 317440000 yields a mean of

    250.507
    

Calculating $h(0)$ and $h(250)$ and comparing them to their respective relative frequencies from the sample yields
$$
\begin{align*}
h(0)&=0.0000000312698&h(250)&=0.00797832\\
h(1)&=0.0000000345445&h(251)&=0.00797513\\
\hat{h}(0)&=0.0000000346522&\hat{h}(250)&=0.00796824
\end{align*}
$$
A statistician may find it amusing to test whether the hypothesis that my derivation of the expectation is correct should be rejected. For me this hints an off-by-one error and it turns out that taking the right endpoint of the step and summing
$$
\sum_{a=0}^\infty (a+1)\cdot h(a)
$$
yields

250.50007433847...

Assuming that Mathematica works correctly and the generation of the sample and its processing are correct and that this derivation actually describes what the code is doing, this suggests that I have made an off-by-one error. Unfortunately I cannot pinpoint the error.

So the questions are:

  1. Can someone pinpoint an error in the modeling?
  2. Can someone pinpoint an off-by-one error?
  3. Can someone pinpoint another error?
  4. Perhaps my intuition is flawed once again and someone can confirm the expectation of $249.5+\epsilon$ by another method?
  5. Can someone confirm the expectation of $250.5+\epsilon$ by another method?

Please accept my apology for this comparatively long posting, but I didn't have time to write a shorter one.

Best Answer

After reading through metaMSE I'll answer the question myself, as the error in the original posting was quickly found, thanks again! And heavy editing of the original post destroys the context for the comments.

As $⌊250−50⋅x⌋\neq250−⌊50⋅x⌋$ contrary to what I wrote above, the derivation of the expectation proceeds as follows. We have $250-\lfloor50\cdot X\rfloor$ where $X\sim N(0, 1)$. Whenever $250-\lfloor50\cdot X\rfloor<0$ is true we forget the result and draw again. Due to $$250<\left\lfloor50\cdot \frac{251}{50}\right\rfloor=251$$ we accept only draws within the half-open interval $X\in[-\infty, 251/50)$, consequently the normal distribution is truncated to this interval. The pdf of this truncated normal distribution is $$ f(x)=\frac{g(x)}{\Phi(251/50)}\qquad g(x)=\begin{cases}0&x\ge \frac{251}{50}\\\phi(x)&x<\frac{251}{50}\end{cases} $$ Integrating $$ \int_{-\infty}^\infty f(x)dx=\frac{1}{\Phi(251/50)}\int_{-\infty}^\infty g(x)dx=\frac{1}{\Phi(251/50)}\int_{-\infty}^{\frac{251}{50}}\phi(x)dx=\frac{\Phi(251/50)}{\Phi(251/50)}=1 $$ confirms that we have a probability distribution. To calculate the expectation $$\mathbb{E}\bigl[250-\lfloor50\cdot X\rfloor\bigr]=250-\int_{-\infty}^\infty\lfloor50\cdot x\rfloor f(x) dx $$ we use the same approach as in the original posting and treat the floor function as what it is, a piecewise constant function. Thus creating a step function allows to collect all floating-point values that are mapped by the floor function to the same integer $a$. We have $$ \begin{align} 250-\lfloor50\cdot x\rfloor&=a\in\mathbb{N}_0\\ 250-a&=\lfloor50\cdot x\rfloor\Leftrightarrow x\in\left[\frac{250-a}{50},\frac{250-a+1}{50}\right) \end{align}$$ Thus integrating over this interval $$ h(a)=\int_{\frac{250-a}{50}}^{\frac{250-a+1}{50}}f(x)dx=\frac{1}{\Phi(251/50)}\int_{\frac{250-a}{50}}^{\frac{250-a+1}{50}}g(x)dx=\frac{\Phi\left(\frac{250-a+1}{50}\right)-\Phi\left(\frac{250-a}{50}\right)}{\Phi(251/50)} $$ gives just the probability for ending up in the interval. Checking that we actually have a distribution is easy the as the sum nicely telescopes. $$ \sum_{a=0}^{\infty}h(a)=\sum_{a=0}^{\infty}\frac{\Phi\left(\frac{250-a+1}{50}\right)-\Phi\left(\frac{250-a}{50}\right)}{\Phi(251/50)}=\frac{\sum_{a=0}^{\infty}\Phi\left(\frac{250-a+1}{50}\right)-\Phi\left(\frac{250-a}{50}\right)}{\Phi(251/50)}=\frac{\Phi\left(\frac{250-0+1}{50}\right)}{\Phi(251/50)}=1$$ Thus the expectation is simply $$ \begin{align} \sum_{a=0}^\infty a\cdot h(a)&=\sum_{a=1}^\infty a\cdot\frac{\Phi\left(\frac{250-a+1}{50}\right)-\Phi\left(\frac{250-a}{50}\right)}{\Phi(251/50)}\\ &=\frac{1}{\Phi(251/50)}\left(\sum_{a=1}^\infty a\cdot \Phi\left(\frac{251-a}{50}\right)-\sum_{a=1}^\infty a\cdot \Phi\left(\frac{250-a}{50}\right)\right)\\ &=\frac{1}{\Phi(251/50)}\left(\sum_{a=0}^\infty (a+1)\cdot \Phi\left(\frac{251-(a+1)}{50}\right)-\sum_{a=1}^\infty a\cdot \Phi\left(\frac{250-a}{50}\right)\right)\\ &=\frac{1}{\Phi(251/50)}\left(\Phi\left(\frac{250}{50}\right)+\sum_{a=1}^\infty (a+1)\cdot \Phi\left(\frac{250-a}{50}\right)-a\cdot \Phi\left(\frac{250-a}{50}\right)\right)\\ &=\frac{1}{\Phi(251/50)}\left(\Phi\left(\frac{250}{50}\right)+\sum_{a=1}^\infty \Phi\left(\frac{250-a}{50}\right)\right)\\ &=\frac{\sum_{a=0}^\infty \Phi\left(\frac{250-a}{50}\right)}{\Phi(251/50)}\approx250.500067... \end{align} $$ Any chance for a closed form of the sum? Probably not.

Related Question