Solved – Process for Standardising and Normalising data

normalizationsmoothingstandard deviationstandardization

This is a problem which I've previously solved very naively. I'm looking to apply more standard statistical theory to this problem in an attempt to get more accurate results. Any help with validating my approach or (rejecting it) and methods I could use would be appreciated (even the correct terms).

The problem is a commerce related one: I have a collection of products (say 100) which have a variety of attributes. For this example I'll pick 3 of them:

  1. Discount Percentage (Current Price/RRP) * 100
  2. Views (Number of times viewed)
  3. Stock Level

Using these attributes I wish to be able to configure various sorting rules in order to place these products into an order. The sorting rules are where the 'configuration' happens – but the statistical part of my problem comes before that.

In order to compare these attributes of products properly I need to:

Standardize & Normalize.

The values and distribution of these attributes:

  1. Discount Percentage: 0 to 100 with a distribution where most products have values of 0(50% of products), 20(25%), 30(10%), 50 (15%)
  2. Views: 0 to 30000 with a fairly even distribution
  3. Stock Level: 0 to 40 with a fairly even distribution (but different from the distribution for views)

I take the entire population for each attribute, and using the max and min values I 'reset' each value to place in in a scale between 0 and 10 – where the maximum value is 0 and the minimum value is 10. For each member I work out if it is in the 0-10% percentile, 10-20 percentile etc. and give it a score of 1 to 10.

However I feel this is a very naive solution as it doesn't take into account the distribution. For some attributes I get 99 members of the population with a score of 10 and 1 member with a score of 0. I'd like a more even distribution where outliers cannot change the end result so drastically (is this called smoothing?). I could do this after the step above – but I think too much data is destroyed by this point to do a good job here. Is there a technique for doing both these steps at the same time (does this have a name?).

Am I right in thinking that the normalization method for each attribute might end up being different? Should I just use trial and error on my data until I get to something that I deem satisfactory or can I consistently achieve a more equatable distribution without having to manually tweak this process? I'd like to have a range of scores between 0 and 1 (rather than 0 and 10).

I with probably end up using R to help me with this as by all accounts it seems to be the best tool for the job. I'm not sure if it's overkill for what I see as being a fairly simple problem – but the complexity involved in the 'correct' Normalisation process may be beyond what I can easily do in my scripting language. Before I jump in with that though I'd like to validate my assumptions/theory.

Weighting

Once the above steps have taken place I allow domain knowledge experts to configure the weighting of these various attributes in order to determine the sort order (for example they may want to say that the number of views is 3 times more important than then stock level, in which case I'll multiple the normalised 'views' figure by 3, and then add it to the others before sorting the entire population). I've read that sometimes this knowledge could be incorporated into the standardisation process – so that the figures themselves take into account their importance relative to other attributes. I don't want to do this.

In the example above I've talked about a population of 100. My real population is between 1000 and 50000, but I still think this is small enough to be able to work with the whole set of data rather than taking a sample.

Best Answer

This doesn't really seem to be a statistical question - & note "standardize" & "normalize" have specialized meanings in statistics. Anyway, why not just order the $n$ observations for each attribute from $1$ to $n$ & score each with $s(x_{(i)})=\left(\frac{i-1}{n-1}\right)$ ? If you wanted to map them to standard Normal variates you could use $s(x_{(i)})=\Phi^{-1}\left(\frac{i-0.3}{n+0.4}\right)$, where $\Phi$ is the Normal cumulative distribution function - but they'll range from $+\infty$ to $-\infty$. "Standardize" usually means subtracting the sample mean from each observation & then dividing by the sample standard deviation, but that doesn't stop standardized values from bunching together at high or low values, & certainly doesn't constrain them to lie between $0$ and $1$. "Normalize" I thought meant multiplying the kernel of a probability distribution by a constant to make it integrate or sum to unity; but see here. Basically it will give you values between 0 & 1, but not stop the bunching - very similar to the first method you tried.