Solved – How to generate random data that conforms to a given mean and upper / lower endpoints

meanminitabrandom-generation

I'm in need of generating a set of random numbers (between 100 and 300 would suffice), which conform to a given mean, and also fall within certain upper and lower endpoints. Using the statistics software that I have (Minitab), I can find options to

generate random data with a given mean but no endpoints, and
generate random data with endpoints and a given mode (not mean).

I think the problem I'm having is that it doesn't seem to be a normal distribution; for instance, one of the data sets should give a mean of 2.90, but range between 0 and 30.66.

It is for a paper I'm writing, wherein the concern is with being able to run statistical analyses on the data in question; in other words, once I generate the data set, I have to do ANOVA and other tests on it. (I'm fine with doing all that; I just can't figure out how to generate the data.)

Best Answer

If you want the distribution on the range min to max and with a given population mean:

One common solution when trying to generate a distribution with specified mean and endpoints is to use a location-scale family beta distribution.

The usual beta is on the range 0-1 and has two parameters, $\alpha$ and $\beta$. The mean of that distribution is $\frac{\alpha}{\alpha+\beta}$. If you multiply by $\text{max}-\text{min}$ and add $\text{min}$, you have something between $\text{max}$ and $\text{min}$ with mean $\text{min}+\frac{\alpha}{\alpha+\beta}(\text{max}-\text{min})$.

This suggests you should take

$\beta/\alpha = \frac{\text{max} - \text{mean}}{\text{mean}-\text{min}}$

$\alpha/\beta = \frac{\text{mean} - \text{min}}{\text{max}-\text{mean}}$

This leaves you with a free parameter (you can choose $\alpha$ or $\beta$ freely and the other is determined). You could choose the smaller of them to be "1". Or you could choose it to satisfy some other condition, if you have one (such as a specified standard deviation). Larger $\alpha$ and $\beta$ will look more 'bell-shaped'.

In Minitab, Calc $\to$ Random Data $\to$ Beta.

Alternatively, you could generate from a triangular distribution rather than a beta distribution. (Or any number of other choices!)

The triangular distribution is usually defined in terms of its min, max and mode, and its mean is the average of the min, max and mode. The triangular distribution is reasonably easy to generate from even if you don't have specialized routines for it. To get the mode from a given mean, use mode = 3$\,\times\,$mean - min - max. However, the mean is restricted to lie in the middle third of the range (which is easy to see from the fact that the mean is the average of the mode and the two endpoints).

Below is a plot of the density functions for a beta (specifically, $\text{beta}(2,3)$) and a triangular distribution, both with mean 40% of the way between the min and the max:

enter image description here

One the other hand, if you want the sample to have a smallest value of min and a largest value of max and a given sample mean, that's quite a different exercise. There are easy ways to do that, though some of them may look a bit odd.

One simple method is as follows. Let $p=\frac{\text{mean} - \text{min}}{\text{max}-\text{min}}$. Place $b=\lfloor p(n-1)\rfloor$ points at 1, and $n-1-b$ points at 0, giving an average of $b/(n-1)$ and a sum of $b$. To get the right average, we need the sum to be $np$, so we place the remaining point at $np-b$, and then multiply all the observations by ${\text{max}-\text{min}}$ and add $\text{min}$.

e.g. consider $n$ = 12, min = 10, max = 60, mean = 30, so $p$ = 0.4, and $b$ = 4. With seven (12-1-4) points at 0 and four at 1, the sum is 4. If we place the remaining point at 12$\,\times\,$0.4$\,$-$\,$4 = 0.8, the average is 0.4 ($p$). We then multiply all the values by ${\text{max}-\text{min}}$ (50) and add $\text{min}$ (10) giving a mean of 30. Then randomly sample the whole set of $n$ without replacement, (or equivalently, just randomly order them). You now have a random sample with the required mean and extremes, albeit one from a discrete distribution.

Using vine method

In this thread: How to efficiently generate random positive-semidefinite correlation matrices? -- I described and provided the code for two efficient algorithms of generating random correlation matrices. Both come from a paper by Lewandowski, Kurowicka, and Joe (2009).

Please see my answer there for a lot of figures and matlab code. Here I would only like to say that the vine method allows to generate random correlation matrices with any distribution of partial correlations (note the word "partial") and can be used to generate correlation matrices with large off-diagonal values. Here is the relevant figure from that thread:

Vine method

The only thing that changes between subplots, is one parameter that controls how much the distribution of partial correlations is concentrated around $\pm 1$. As OP was asking for an approximately normal distribution off-diagonal, here is the plot with histograms of the off-diagonal elements (for the same matrices as above):

Off-diagonal elements

I think this distributions are reasonably "normal", and one can see how the standard deviation gradually increases. I should add that the algorithm is very fast. See linked thread for the details.

My original answer

A straight-forward modification of your method might do the trick (depending on how close you want the distribution to be to normal). This answer was inspired by @cardinal's comments above and by @psarka's answer to my own question How to generate a large full-rank random correlation matrix with some strong correlations present?

The trick is to make samples of your $\mathbf X$ correlated (not features, but samples). Here is an example: I generate random matrix $\mathbf X$ of $1000 \times 100$ size (all elements from standard normal), and then add a random number from $[-a/2, a/2]$ to each row, for $a=0,1,2,5$. For $a=0$ the correlation matrix $\mathbf X^\top \mathbf X$ (after standardizing the features) will have off-diagonal elements approximately normally distributed with standard deviation $1/\sqrt{1000}$. For $a>0$, I compute correlation matrix without centering the variables (this preserves the inserted correlations), and the standard deviation of the off-diagonal elements grow with $a$ as shown on this figure (rows correspond to $a=0,1,2,5$):

random correlation matrices

All these matrices are of course positive definite. Here is the matlab code:

offsets = [0 1 2 5];
n = 1000;
p = 100;

rng(42) %// random seed

figure
for offset = 1:length(offsets)
    X = randn(n,p);
    for i=1:p
        X(:,i) = X(:,i) + (rand-0.5) * offsets(offset);
    end
    C = 1/(n-1)*transpose(X)*X; %// covariance matrix (non-centred!)

    %// convert to correlation
    d = diag(C);
    C = diag(1./sqrt(d))*C*diag(1./sqrt(d));

    %// displaying C
    subplot(length(offsets),3,(offset-1)*3+1)
    imagesc(C, [-1 1])

    %// histogram of the off-diagonal elements
    subplot(length(offsets),3,(offset-1)*3+2)
    offd = C(logical(ones(size(C))-eye(size(C))));
    hist(offd)
    xlim([-1 1])

    %// QQ-plot to check the normality
    subplot(length(offsets),3,(offset-1)*3+3)
    qqplot(offd)

    %// eigenvalues
    eigv = eig(C);
    display([num2str(min(eigv),2) ' ... ' num2str(max(eigv),2)])
end

The output of this code (minimum and maximum eigenvalues) is:

0.51 ... 1.7
0.44 ... 8.6
0.32 ... 22
0.1 ... 48

Solved – How to simulate data that satisfy specific constraints such as having specific mean and standard deviation

In general, to make your sample mean and variance exactly equal to a pre-specified value, you can appropriately shift and scale the variable. Specifically, if $X_1, X_2, ..., X_n$ is a sample, then the new variables

$$ Z_i = \sqrt{c_{1}} \left( \frac{X_i-\overline{X}}{s_{X}} \right) + c_{2} $$

where $\overline{X} = \frac{1}{n} \sum_{i=1}^{n} X_i$ is the sample mean and $ s^{2}_{X} = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \overline{X})^2$ is the sample variance are such that the sample mean of the $Z_{i}$'s is exactly $c_2$ and their sample variance is exactly $c_1$. A similarly constructed example can restrict the range -

$$ B_i = a + (b-a) \left( \frac{ X_i - \min (\{X_1, ..., X_n\}) }{\max (\{X_1, ..., X_n\}) - \min (\{X_1, ..., X_n\}) } \right) $$

will produce a data set $B_1, ..., B_n$ that is restricted to the interval $(a,b)$.

Note: These types of shifting/scaling will, in general, change the distributional family of the data, even if the original data comes from a location-scale family.

Within the context of the normal distribution the mvrnorm function in R allows you to simulate normal (or multivariate normal) data with a pre-specified sample mean/covariance by setting empirical=TRUE. Specifically, this function simulates data from the conditional distribution of a normally distributed variable, given the sample mean and (co)variance is equal to a pre-specified value. Note that the resulting marginal distributions are not normal, as pointed out by @whuber in a comment to the main question.

Here is a simple univariate example where the sample mean (from a sample of $n=4$) is constrained to be 0 and the sample standard deviation is 1. We can see that the first element is far more similar to a uniform distribution than a normal distribution:

library(MASS)
 z = rep(0,10000)
for(i in 1:10000)
{
    x = mvrnorm(n = 4, rep(0,1), 1, tol = 1e-6, empirical = TRUE)
    z[i] = x[1]
}
hist(z, col="blue")

$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ $ enter image description here

Best Answer

Related Solutions

Solved – How to generate random correlation matrix that has approximately normally distributed off-diagonal entries with given standard deviation

Using vine method

My original answer

Solved – How to simulate data that satisfy specific constraints such as having specific mean and standard deviation

Related Question