Solved – How best to normalize count data to compare two distributions

correspondence-analysisdiscrete datadistributionsnormalization

Say I have a vector of length 1000. At each position (1 … 1000) there is a count. I have two vectors with different range of counts such that in vector A the maximum number of counts at a position is 30, whereas in vector B the maximum is say 200 (i.e., there are more counts in B than A).

So essentially I have two discrete distributions and when I plot the two distributions they have peaks and troughs (some regions along the vector have higher counts than others). I am not interested in the differences in raw counts between A and B but instead wish to compare the shape of the curves in order to test whether the same regions (based on the index) in A and B have the higher counts relative to other regions.

My problem is that in order to compare the shapes I need to normalize the counts in each vector. I do not have have too much stats background and so any advice regarding an appropriate normalization procedure would be appreciated. The best, but probably not the best solution I can think of is to transform the vectors to have the same mean.

Best Answer

So you have two vectors of counts (length 1000) and wants to compare their shape, irrespective of the absolute counts. You can put the two vectors together as a contingency table, the individual vectors $A, B$ as rows, in R you could do mytable <- rbind(A,B). Comparing the shape or profile of the rows is just what is done by correspondence analysis, and that is in this case more or less what the comment by @Michael Chernick amounts to. A related post is Comparing two histograms using Chi-Square distance where the concept of chisquare distance is explained. This link: How to assess the similarity of two histograms? gives alternatives.

If you don't know about correspondence analysis, you could just search this site for the tag .