Data Normalization – How to Standardize or Normalize Count Data Effectively

count-datanormalizationstandardization

I have a dataset where I counted the number of a species in different environments and grouped it into different categories ranging from 0 to 5. 0= no occurrence; 5= very high occurrence. All categories are well defined. I know that the environment has an important influence: however, there is a lot of variation for the different environments. To identify spots with high occurrences I wanted to standardize my data for every environment to be able to compare them and identify the samples with abnormal high/low occurrence.

E.g.:
For environment A the occurrence varies from 0 to 5 with 95% in group 0 or 1. 5% are in group 5. For environment B 95% show occurrence 0 but 5% occurrence 2. As environment A is very favorable for the species I am observing and B is not, group B occurrence in group 2 is already quite high for my species in this environment. I am therefore looking for a standardization that for every environment transforms the values to have the same mean. I want to find spots that show above average occurrence for the environment they are in. I therefore need some kind of standardization/normalization for count data.

Can I use z-Score transformation ((x-mean)/stdDev) or are there more appropriate tests for this?

Best Answer

Your transformation to 6 graded categories has already thrown away much of the information you have. What you propose seems a further step without a clear statistical rationale.

At best, variations such as you report, for example

For environment A the occurrence varies from 0 to 5 with 95% in group 0 or 1. 5% are in group 5. For environment B 95% show occurrence 0 but 5% occurrence 2.

are interesting variations you want to report and explain. Some model for an ordered response such as ordinal logit might be helpful if you insist on using those grades.

But once you have those grades 0 to 5, their means and SDs are of doubtful use, if only because they depend on an arbitrary transformation. (Note that there is no sense in which those grades should be approximately normal, even after transformation, if that is what you are thinking.) So, one false step can't be corrected by another. Otherwise put, why will standardization make those variations easier to understand? Trying to explain differences in standardized values would be difficult to impossible unless you reinserted the means and SDs.

If this were my problem I would use some kind of count model, possibly Poisson regression, to deal directly with number of individuals as a response. An arbitrary degradation of the data to 6 categories would have no obvious scientific or statistical rationale or interest. I think you would have an uphill task to justify that convincingly in a report. If the counts seem too spiky (several zeros, some relatively high values) to handle easily, then an old-fashioned but still possibly useful method would be some transformation such as square roots.

A fuller answer would need more information on what data you have. At present the picture is of counts at various sites in different kinds of environment. With nothing else said, that points to an ANOVA on transformed counts (old way) or a Poisson or other count model (newer way). In such analysis, the predicted mean counts automatically give you the framework you desire, defining what is typical for an environment and hence what is not.

Related Solutions

Solved – Clustering of very skewed, count data: any suggestions to go about (transform etc)

It is not wise to transform the variables individually because they belong together (as you noticed) and to do k-means because the data are counts (you might, but k-means is better to do on continuous attributes such as length for example).

In your place, I would compute chi-square distance (perfect for counts) between every pair of customers, based on the variables containing counts. Then do hierarchical clustering (for example, average linkage method or complete linkage method - they do not compute centroids and threfore don't require euclidean distance) or some other clustering working with arbitrary distance matrices.

Copying example data from the question:

-----------------------------------------------------------
customer | count_red  |    count_blue   | count_green     |
-----------------------------------------------------------
c0       |    12      |        5        |       0         |
-----------------------------------------------------------
c1       |     3      |        4        |       0         |
-----------------------------------------------------------
c2       |     2      |       21        |       0         |
-----------------------------------------------------------
c3       |     4      |        8        |       1         |
-----------------------------------------------------------

Consider pair c0 and c1 and compute Chi-square statistic for their 2x3 frequency table. Take the square root of it (like you take it when you compute usual euclidean distance). That is your distance. If the distance is close to 0 the two customers are similar.

It may bother you that sums in rows in your table differ and so it affects the chi-square distance when you compare c0 with c1 vs c0 with c2. Then compute the (root of) the Phi-square distance: Phi-sq = Chi-sq/N where N is the combined total count in the two rows (customers) currently considered. It is thus normalized distance wrt to overall counts.

Here is the matrix of sqrt(Chi-sq) distance between your four customers
 .000   1.275   4.057   2.292
1.275    .000   2.124    .862
4.057   2.124    .000   2.261
2.292    .862   2.261    .000

And here is the matrix of sqrt(Phi-sq) distance 
.000    .260    .641    .418
.260    .000    .388    .193
.641    .388    .000    .377
.418    .193    .377    .000

So, the distance between any two rows of the data is the (sq. root of) the chi-square or phi-square statistic of the 2 x p frequency table (p is the number of columns in the data). If any column(s) in the current 2 x p table is complete zero, cut off that column and compute the distance based on the remaining nonzero columns (it is OK and this is how, for example, SPSS does when it computes the distance). Chi-square distance is actually a weighted euclidean distance.

Data Normalization – How to Normalize Count Data of Time Periods with Different Lengths

Generally you don't make them comparable by doing something to the counts, but you do take account of the different exposures in computing the expected values in the chi-squared test.

Under a null hypothesis of equal event rates (events per hour), the two periods can simply be combined to estimate the rate ... that is $275+129$ events in $120+48$ hours, so we estimate the rate as $\frac{275+129}{120+48}$ events per hour, and the expected count in period 1 is then $(275+129)\frac{120}{120+48}\approx 288.57$ and in period 2 is $(275+129)\frac{48}{120+48}\approx 115.43$.

With those expected values, the chi-square goodness of fit statistic, $\sum_i \frac{(O_i-E_i)^2}{E_i}$ is straightforward to calculate by hand; it has $k-1=1$ degree of freedom in this example. However, it's a pretty standard calculation - for example, here it is in R:

eventcounts = c(275,129)
exposuretime = c(120,48)
chisq.test(eventcounts,  p = exposuretime, rescale.p = TRUE)

        Chi-squared test for given probabilities

data:  eventcounts
X-squared = 2.2339, df = 1, p-value = 0.135

which is the same result as doing it by hand.

Best Answer

Related Solutions

Solved – Clustering of very skewed, count data: any suggestions to go about (transform etc)

Data Normalization – How to Normalize Count Data of Time Periods with Different Lengths

Related Question