Solved – Build a (normal?) distribution from $n$, quartiles and mean

distributionsestimationnormal distributionquantiles

I have some data that is described by $n$, quartiles (+ additional quantile point) and the mean. Is it possible to rebuild or model this distribution from these statistics? As the median and the mean are not the same, there is at least some skew, but otherwise, I would assume the data to be normal like.

Edit: This was marked as a duplicate, but in the other questions I found while searching, none of them included the information regarding the mean as a data point to recreate the distribution. Because of that additional parameter, I wondered if it made the estimation possible. In short, the affect of having the mean was not apparent from the other answers related to the question.

Best Answer

The answer is No, not exactly anyhow.

If you have two quartiles of a normal population then you can find $\mu$ and $\sigma.$ For example the lower and upper quantiles of $\mathsf{Norm}(\mu = 100,\, \sigma = 10)$ are $93.255$ and $106.745,$ respectively.

 qnorm(c(.25, .75), 100, 10)
 [1]  93.2551 106.7449

Then $P\left(\frac{X-\mu}{\sigma} < -0.6745\right) = 0.25$ and $P\left(\frac{X-\mu}{\sigma} < 0.6745\right) = 0.75$ provide two equations that can be solved to find $\mu$ and $\sigma.$

qnorm(c(.25,.75))
[1] -0.6744898  0.6744898

However, sample quartiles are not population quartiles. There is not enough information in any normal sample precisely to determine $\mu$ and $\sigma.$

And you are not really sure your sample is from a normal population. If the population has mean $\mu$ and median $\eta,$ then the sample mean and median, respectively, are estimates of these two parameters. If the population is symmetrical, then $\mu = \eta,$ but you say the sample mean and median do not agree. So you cannot be sure the population is symmetrical, much less normal.

Related Solutions

Solved – Dividing and forecasting a normal distribution

The straight answer to Q1 is "yes", it is definitely possible to cut up an underlying normally distributed continuous variable into an ordinal variable with 1 to 10 levels. You need something that can tell you the cumulative distribution function (often called CDF) of a normal distribution with a given mean and variance (you only need these two parameters to characterise a normal distribution). Then you need to calculate the differences between the values this returns for your various bin cutoffs (as its straight return will be the cumulative probability of a value at X or lower).

I'm sorry I don't use C# but in R this would be something like the below. This is for a 10 point example, if the normal distribution you think is your underlying latent variable has a mean of 5 and variance of 2; and my bins are minus infinity to 1.5, 1.5 to 2.5, 2.5 to 3.5, ... , 9.5 to infinity. You only need the mean and variance to characterise a normal distribution.

> options(digits=2)
> x <- pnorm(1:10+0.5, 5, 2)*100
> x[10] <- 100            # otherwise is just 9.5 to 10.5, not infinity
> x                       # ie cumulative prob (in %) to each bin
 [1]   4  11  23  40  60  77  89  96  99 100    
> c(x[1], diff(x))        # differences between the cumulative probs
 [1]  4.0  6.6 12.1 17.5 19.7 17.5 12.1  6.6  2.8  1.2

Subsequently, the straight answer to Q2 is also "yes" there are definitely such methods but they should be used with caution and it is probably a little difficult just here to summarise all the pros and cons of the different ways of doing this.

It's also worth knowing that there are other methods for analysing this sort of ordinal data.

Solved – Robust parameter estimation for shifted log normal distribution

In case anyone is still interested, I have managed to implement Aristizabal's formulae in Java. This is more proof-of-concept than the requested "robust" code, but it is a starting point.

/**
 * Computes the point estimate of the shift offset (gamma) from the given sample. The sample array will be sorted by this method.<p>
 * Cf. Aristizabal section 2.2 ff.
 * @param sample {@code double[]}, will be sorted
 * @return gamma point estimate
 */
public static double pointEstimateOfGammaFromSample(double[] sample) {
    Arrays.sort(sample);
    DoubleUnaryOperator func = x->calculatePivotalOfSortedSample(sample, x)-1.0;
    double upperLimit = sample[0];
    double lowerLimit = 0;
    double gamma = bisect(func, lowerLimit, upperLimit);
    return gamma;
}

/**
 * Cf. Aristizabal's equation (2.3.1)
 * @param sample {@code double[]}, should be sorted in ascending order
 * @param gamma shift offset
 * @return pivotal value of sample
 */
private static double calculatePivotalOfSortedSample(final double[] sample, double gamma) {
    final int n=sample.length;
    final int n3=n/3;
    final double mid = avg(sample, gamma, n3+1, n-n3);
    final double low = avg(sample, gamma, 1, n3);
    final double upp = avg(sample, gamma, n-n3+1, n);
    final double result = (mid-low)/(upp-mid);
    return result;
}

/**
 * Computes average of sample values from {@code sample[l-1]} to {@code sample[u-1]}.
 * @param sample {@code double[]}, should be sorted in ascending order
 * @param gamma shift offset
 * @param l lower limit
 * @param u upper limit
 * @return average
 */
private static double avg(double[] sample, double gamma, int l, int u) {
    double sum = 0.0;
    for (int i=l-1;i<u;sum+=Math.log(sample[i++]-gamma));
    final int n = u-l+1;
    return sum/n;
}

/**
 * Naive bisection implementation. Should always complete if the given values actually straddles the root.
 * Will call {@link #secant(DoubleUnaryOperator, double, double)} if they do not, in which case the
 * call may not complete.
 * @param func Function solve for root value
 * @param lowerLimit Some value for which the given function evaluates < 0
 * @param upperLimit Some value for which the given function evaluates > 0
 * @return x value, somewhere between the lower and upper limits, which evaluates close enough to zero
 */
private static double bisect(DoubleUnaryOperator func, double lowerLimit, double upperLimit) {
    final double eps = 0.000001;
    double low=lowerLimit;
    double valAtLow = func.applyAsDouble(low);
    double upp=upperLimit;
    double valAtUpp = func.applyAsDouble(upp);
    if (valAtLow*valAtLow>0) {
        // Switch to secant method
        return secant(func, lowerLimit, upperLimit);
    }
    System.out.printf("bisect %f@%f -- %f@%f%n", valAtLow, low, valAtUpp, upp);
    double mid;
    while(true) {
        mid = (upp+low)/2;
        if (Math.abs(upp-low)/low<eps)
            break;
        double val = func.applyAsDouble(mid);
        if (Math.abs(val)<eps)
            break;
        if (val<0)
            low=mid;
        else
            upp=mid;
    }
    return mid;
}

/**
 * Naive secant root solver implementation. May not complete if root not found.
 * @param f Function solve for root value
 * @param a Some value for which the given function evaluates
 * @param b Some value for which the given function evaluates
 * @return x value which evaluates close enough to zero
 */
static double secant(final DoubleUnaryOperator f, double a, double b) {
    double fa = f.applyAsDouble(a);
    if (fa==0)
        return a;
    double fb = f.applyAsDouble(b);
    if (fb==0)
        return b;
    System.out.printf("secant %f@%f -- %f@%f%n", fa, a, fb, b);
    if (fa*fb<0) {
        return bisect(f, a, b);
    }
    while ( abs(b-a) > abs(0.00001*a) ) {
          final double m = (a+b)/2;
          final double k = (fb-fa)/(b-a);
          final double fm = f.applyAsDouble(m);
          final double x = m-fm/k;
          if (Math.abs(fa)<Math.abs(fb)) {
              // f(a)<f(b); Choose x and a
              b=x;
              fb=f.applyAsDouble(b);
          } else {
              // f(a)>=f(b); Choose x and b
              a=x;
              fa=f.applyAsDouble(a);
          }
          if (fa==0)
              return a;
          if (fb==0)
              return b;
          if (fa*fb<0) {
              // Straddling root; switch to bisect method
              return bisect(f, a, b);
          }
      }
    return (a+b)/2;

}

Best Answer

Related Solutions

Solved – Dividing and forecasting a normal distribution

Solved – Robust parameter estimation for shifted log normal distribution

Related Question