Solved – How to sample from truncated distributions using scipy

distributionsnumpypythonsamplingscipy

I have min and max values for certain variables such as:

  1. Expenses
  2. Loss
  3. Growth

I'd like to add a distribution around them and plot a histogram in python. Distribution could be beta, gamma right type as these variables tend to follow.

Based on the documentation for sampling from different distribution types, I need to define mean and sigma.

My function calculates the mean of two numbers which define the range, and sample from a different distribution types.

def draw_samples(min, max):
    
    avg = (min + max)/2
    print(avg)

    arr = []

    for i in range(200):

        # Simulate rent growth %
        r = np.random.disttype(avg, .2, size=1)

        arr.append(r[0])


    #print("List of values", arr)    
    print("Mean value", np.mean(arr))

    plt.hist(arr, density=True, bins=30)  # density=False would make counts
    plt.ylabel('Probability')
    plt.xlabel('Data')

# Call function
draw_samples(1,6)

The range is values on the x-axis is not within the defined range (1,6). How are values sampled being defined?

Basically, I'd like to randomly sample n times but follow a certain distribution type and then compute the average value.

Say, growth variable can take values anywhere from -2 to 4% but with a non uniform distribution like gamma right, small chance it is <0. So then, how do I add a truncated distribution and sample from it?

Another way of doing this could be to sample n times using r = np.random.disttype(mu, .2, size=n)?

Best Answer

Perhaps the simplest and most generic method is inverse transform sampling. All of the distributions mentioned in OP are absolutely continuous, so it's very straightforward. Morevoer, for these distributions, the key components are all implemented in scipy already, so we only need to do algebra.

Suppose $X$ is distributed according to some CDF $G$. The truncated distribution $F$ is how $x$ is distributed given that it's restricted to the interval $[a,b]$. This is just rescaling and shifting the CDF $G$, so we have $$ F(y) = \frac{G(y) - G(a)}{G(b)-G(a)}. $$

Inverse transform sampling observes that for some continuous random variable, we can sample from a CDF $F$ using a uniform distribution. This is because $F(F^{-1}(u))=u$.

Putting it all together,

$$ F^{-1}(u)=G^{-1}\left[u(G(b) - G(a)) +G(a) \right] $$

scipy implements CDFs $G$ and quantile functions $G^{-1}$ for all of the distributions you list, so the rest is just writing some (simple!) code:

  • draw a value of $u$ from a Uniform(0,1) distribution
  • compute $F^{-1}(u)$

There are lots of other algorithms to solve this problem (rejection sampling, ziggurat algorithm and many more), at various trade-offs of speed, precision and complexity of implementation. I think this one sits at a nice compromise.

If this seems too complicated, then you could do something even more naive and just reject any values that are outside of your desired bounds. This is inefficient, especially if $F(b) - F(a)$ the probability in the interval $[a,b]$ is small.