Gaussian Bell-Curve – How to Force a Set of Numbers to a Gaussian Bell-Curve

algorithmsnormal distribution

(This relates to my programming question on Stack Overflow: Bell Curve Gaussian Algorithm (Python and/or C#).)

On Answers.com, I found this simple example:

  1. Find the arithmetic mean (average)
    => Sum of all values in the set, divided by the numbers of elements in the set
  2. Find the sum of the squares of all values in the set
  3. Divide output of (2) over the numbers of elements in the set
  4. Subtract the square of mean (1) from the output of (3)
  5. Take the square root of the outcome of (4)

Example: Set A={1,3,4,5,7}

  1. (1+3+4+5+7)/5 = 4
  2. (1*1+3*3+4*4+5*5+7*7)=1+9+16+25+49 = 100
  3. 100 / 5 = 20
  4. 20 – 4*4=20-16 = 4
  5. SQRT(4) = 2

(This comes from a post on wiki.answers.com.)

Now given all that, how can I fit the above data to a bell curve (such as a credit score) ranging from 200 to 800. Obviously the number 5 in the above set would be 500. But then what is the formula for determining what 3 should be on the same scale. Even though the original set Set A={1,3,4,5,7} is not a bell-curve, I want to force it into a bell-curve.

Imagine these are scores of 5 people. Next month the scores might change as follows: Set A2={1,2,4,5,9} (one guys loses a point, and the top guy gains two more points – the rich get richer and the poor get poorer). Then perhaps a new guy comes into the set: Set A3={1,2,4,5,8,9}.

Best Answer

A scaled range, like 200 to 800 (for SATs, e.g.), is just a change of units of measurement. (It works exactly like changing temperatures in Fahrenheit to those in Celsius.)

The middle value of 500 is intended to correspond to the average of the data. The range is intended to correspond to about 99.7% of the data when the data do follow a Normal distribution ("Bell curve"). It is guaranteed to include 8/9 of the data (Chebyshev's Inequality).

In this case, the formula 1-5 computes the standard deviation of the data. This is simply a new unit of measurement for the original data. It needs to correspond to 100 units in the new scale. Therefore, to convert an original value to the scaled value,

  • Subtract the average.

  • Divide by the standard deviation.

  • Multiply by 100.

  • Add 500.

If the result lies beyond the range $[200, 800]$ you can either use it as-is or "clamp" it to the range by rounding up to 200, down to 800.

In the example, using data $\{1,3,4,5,7\}$, the average is $4$ and the SD is $2$. Therefore, upon rescaling, $1$ becomes $(1 - 4)/2 * 100 + 500 = 350$. The entire rescaled dataset, computed similarly, is $\{350, 450, 500, 550, 650\}$.

When the original data are distributed in a distinctly non-normal way, you need another approach. You no longer compute an average or SD. Instead, put all the scores in order, from 1st (smallest) up to $n$th (largest). These are their ranks. Convert any rank $i$ into its percentage $(i-1/2)/n$. (In the example, $n=5$ and data are already in rank order $i=1,2,3,4,5$. Therefore their percentages are $1/10, 3/10, 5/10, 7/10, 9/10$, often written equivalently as $10\%, 30\%$, etc.) Corresponding to any percentage (between $0$ and $1$, necessarily) is a normal quantile. It is computed with the normal quantile function, which is closely related to the error function. (Simple numerical approximations are straightforward to code.) Its values, which typically will be between -3 and 3, have to be rescaled (just as before) to the range $[200, 800]$. Do this by first multiplying the normal quantile by 100 and then adding 500.

The normal quantile function is available in many computing platforms, including spreadsheets (Excel's normsinv, for instance). For example, the normal quantiles (or "normal scores") for the data $\{1,3,4,5,7\}$ are $\{372, 448, 500, 552, 628\}$.

This "normal scoring" approach will always give scores between 200 and 800 when you have 370 or fewer values. When you have 1111 or fewer values, all but the highest and lowest will have scores between 200 and 800.