Data Normalization – How to Standardize or Normalize Count Data Effectively

count-datanormalizationstandardization

I have a dataset where I counted the number of a species in different environments and grouped it into different categories ranging from 0 to 5. 0= no occurrence; 5= very high occurrence. All categories are well defined. I know that the environment has an important influence: however, there is a lot of variation for the different environments. To identify spots with high occurrences I wanted to standardize my data for every environment to be able to compare them and identify the samples with abnormal high/low occurrence.

E.g.:
For environment A the occurrence varies from 0 to 5 with 95% in group 0 or 1. 5% are in group 5. For environment B 95% show occurrence 0 but 5% occurrence 2. As environment A is very favorable for the species I am observing and B is not, group B occurrence in group 2 is already quite high for my species in this environment. I am therefore looking for a standardization that for every environment transforms the values to have the same mean. I want to find spots that show above average occurrence for the environment they are in. I therefore need some kind of standardization/normalization for count data.

Can I use z-Score transformation ((x-mean)/stdDev) or are there more appropriate tests for this?

Best Answer

Your transformation to 6 graded categories has already thrown away much of the information you have. What you propose seems a further step without a clear statistical rationale.

At best, variations such as you report, for example

For environment A the occurrence varies from 0 to 5 with 95% in group 0 or 1. 5% are in group 5. For environment B 95% show occurrence 0 but 5% occurrence 2.

are interesting variations you want to report and explain. Some model for an ordered response such as ordinal logit might be helpful if you insist on using those grades.

But once you have those grades 0 to 5, their means and SDs are of doubtful use, if only because they depend on an arbitrary transformation. (Note that there is no sense in which those grades should be approximately normal, even after transformation, if that is what you are thinking.) So, one false step can't be corrected by another. Otherwise put, why will standardization make those variations easier to understand? Trying to explain differences in standardized values would be difficult to impossible unless you reinserted the means and SDs.

If this were my problem I would use some kind of count model, possibly Poisson regression, to deal directly with number of individuals as a response. An arbitrary degradation of the data to 6 categories would have no obvious scientific or statistical rationale or interest. I think you would have an uphill task to justify that convincingly in a report. If the counts seem too spiky (several zeros, some relatively high values) to handle easily, then an old-fashioned but still possibly useful method would be some transformation such as square roots.

A fuller answer would need more information on what data you have. At present the picture is of counts at various sites in different kinds of environment. With nothing else said, that points to an ANOVA on transformed counts (old way) or a Poisson or other count model (newer way). In such analysis, the predicted mean counts automatically give you the framework you desire, defining what is typical for an environment and hence what is not.

Related Question