Solved – Generating sorted pseudo-random numbers in Stata

random-generationstata

Today I opened two STATA windows and ran the following command in both:

set obs 100
gen x = rnormal()
sort x

(the difference is that on the second window I generated a variable called y). Summing up: I asked STATA to give me 100 pseudo-random numbers taken from a standard normal distribution, then I sorted it. To my surprise, the numbers of the x and y vectors are the same! I did this at home, and then at work, and my impression is that all of these vectors are the same. Is there an explanation for this, to me, strange behavior?

If this is a problem in STATA, does R have a better pseudo-random number generator procedure?

A side question. I came up to this "problem" because I was trying to generate two pseudo-random columns in Stata (x and y, say), and then sort then separately. But the two commands I know for sorting (sort and gsort) sort the whole database, not separate columns. Would you know of a Stata command that allows me to sort a column while keeping the other columns fixed?

Best Answer

The help for set_seed states

The sequences these functions produce are determined by the seed, which is just a number and which is set to 123456789 every time Stata is launched.

Stata's philosophy emphasizes reproducibility, so this consistency is not surprising. Of course you can set the seed yourself. See the help page for more information.

One way to sort a column separately from all others is to preserve your data, keep only the column to sort, sort it, save the results in a temporary file, restore your data, and merge the temporary file:

gen y = rnormal()
preserve
keep y
sort y
tempfile out
save `out'
restore
merge 1:1 _n using `out', nogen

Related Solutions

Solved – What’s the probability that from 25 random numbers between 1 and 100, the highest appears more than once

Let

$x$ be the top end of your range, $x=100$ in your case.
$n$ be the total number of draws, $n=25$ in your case.

For any number $y\le x$, the number of sequences of $n$ numbers with each number in the sequence $\le y$ is $y^n$. Of these sequence, the number containing no $y$s is $(y-1)^n$, and the number containing one $y$ is $n(y-1)^{n-1}$. Hence the number of sequences with two or more $y$s is $$y^n - (y-1)^n - n(y-1)^{n-1}$$ The total number of sequences of $n$ numbers with highest number $y$ containing at least two $y$s is \begin{align} \sum_{y=1}^x \left(y^n - (y-1)^n - n(y-1)^{n-1}\right) &= \sum_{y=1}^x y^n - \sum_{y=1}^x(y-1)^n - \sum_{y=1}^xn(y-1)^{n-1}\\ &= x^n - n\sum_{y=1}^x(y-1)^{n-1}\\ &= x^n - n\sum_{y=1}^{x-1}y^{n-1}\\ \end{align}

The total number of sequences is simply $x^n$. All sequences are equally likely and so the probability is $$ \frac{x^n - n\sum_{y=1}^{y=x-1}y^{n-1}}{x^n}$$

With $x=100,n=25$ I make the probability 0.120004212454.

I've tested this using the following Python program, which counts the sequences that match manually (for low $x,n$), simulates and calculates using the above formula.

import itertools
import numpy.random as np

def countinlist(x, n):
    count = 0
    total = 0
    for perm in itertools.product(range(1, x+1), repeat=n):
        total += 1
        if perm.count(max(perm)) > 1:
            count += 1

    print "Counting: x", x, "n", n, "total", total, "count", count

def simulate(x,n,N):
    count = 0
    for i in range(N):
        perm = np.randint(x, size=n)
        m = max(perm)
        if sum(perm==m) > 1:
            count += 1
    print "Simulation: x", x, "n", n, "total", N, "count", count, "prob", count/float(N)

x=100
n=25
N = 1000000 # number of trials in simulation

#countinlist(x,n) # only call this for reasonably small x and n!!!!
simulate(x,n,N)
formula = x**n - n*sum([i**(n-1) for i in range(x)])
print "Formula count", formula, "out of", x**n, "probability", float(formula) / x**n

This program outputted

Simulation: x 100 n 25 total 1000000 count 120071 prob 0.120071
Formula count 12000421245360277498241319178764675560017783666750 out of 100000000000000000000000000000000000000000000000000 probability 0.120004212454

Solved – References and Best practices for setting seeds in pseudo-Random Number Generation

For what it's worth, this is based on experience and not on mathematical analysis:

I think that unless you're doing cryptography, where subtle patterns can be very bad, which seed you set doesn't make a difference, as long as you use accepted good PRNGs like Mersenne Twister and not old ones like linear congruential generators. As far as I know, there is no way that you can tell what random number will come out from a given seed without actually running the PRNG (assuming it's a decent one), otherwise you would just take that new algorithm and use that as your random number generator.

Another perspective: do you think that any subtle patterns in your Monte-Carlo simulation are likely to be of a larger magnitude than all the measurement error, confounding, and error introduces by other modeling assumptions?

I would just use one random seed at the beginning for reproducibility, and not set one before each call, unless I'm doing debugging, where I need to make sure two different algorithms produce the same result for the exact same input data.

Disclaimer: if you simulating nuclear reactors or missile control systems or weather forecasting, best to consult domain experts, I take no responsibility in that case.

Best Answer

Related Solutions

Solved – What’s the probability that from 25 random numbers between 1 and 100, the highest appears more than once

Solved – References and Best practices for setting seeds in pseudo-Random Number Generation

Related Question