Solved – Normalizing a 2D-histogram and getting the marginals

distributionshistogrammarginal-distributionMATLAB

I have a 2D-histogram for two vectors, s1 and s2, generated using the hist3 function in Matlab:

[hist2D, binC] = hist3([s1' s2']);

I am normalizing it by making its total volume equal to unity, in the following manner:

L = binC{1}(2) - binC{1}(1);
B = binC{2}(2) - binC{2}(1);
totalVolume = sum(sum(hist2D.*L*B));
prob2D = hist2D/totalVolume;

Question: Is this correct way to normalize a 2D-histogram?

I have also normalized the 1D-histograms for s1 and s2, as shown below.

[hist1, binCentres1] = hist(s1);
binWidth1 = binCentres1(2) - binCentres1(1);
prob1 = hist1 / (sum(hist1) * binWidth1);
%same for s2

Question: How do I get the marginal (1D-)histograms from the normalized 2D-histogram?

I have tried doing this in the following manner:

prob1M = sum(prob2D, 2); %extract marginal for s1
prob2M = sum(prob2D, 1); %extract marginal for s2

If I were doing this correctly, I'd expect prob1 to be equal to prob1M. I seem to be on the right track because the bar graphs (below) look similar, but scaled on the vertical axis. Maybe I'm doing the normalizations wrong?

Bar Graphs fro prob1 and prob1M

I tried normalizing prob1M as well, using:

prob1M = prob1M / (sum(prob1M) * binWidth1);

Question: Is normalization still necessary if you get the marginals from a normalized 2D-histogram? Why / Why not?

After normalization, the area of prob1 and prob1M is equal to 1, where area is calculated as:

area = sum(binWidth1.*prob1)
%same for prob1M

However, prob1 and prob1M (and prob2 and prob2M) are still slightly different:

prob1 = 0.9412 0.4412 0.3235 0.3235 0.2941 0.3235 0.3235 0.4118 0.4706 1.1471
prob1M = 0.9706 0.4706 0.3824 0.3235 0.3235 0.3235 0.3530 0.3824 0.4412 1.0295

Thank you for your suggestions.

Best Answer

Densities can be hard to work with. Whenever you can, calculate with the total probabilities instead.

Usually, histograms begin with point data, such as these 10,000 points:

Scatterplot

A general 2D histogram tessellates the domain of the two variables (here, the unit square) by a collection $P$ of non-overlapping polygons (usually rectangles or triangles). To each polygon $p$ it assigns a density (probability or relative frequency per unit area). This is computed as

$$\text{density}(p) = \frac{\text{count within}(p)}{\text{total count}} / \text{area}(p).$$

The $\frac{\text{count within}(p)}{\text{total count}}$ part estimates the probability of $p$; when it is divided by the area of $p$, you get the density.

2D histogram

^{In this 2D histogram, the unit square has been tessellated by rectangles of width $1/26$ and height $1/11$.}

2D histograms represent probability (or relative frequency) by means of volume: for each polygon $p$, the product of height and base, or density * area, returns $\frac{\text{count within}(p)}{\text{total count}}$. As a check, the total probability is obtained by summing the volumes over all polygons:

$$\eqalign{ \text{Total probability} &= \sum_{p \in P}\text{area}(p)\text{density}(p) \\ &= \sum_{p \in P}\frac{\text{count within}(p)}{\text{total count}} \\ &= \frac{1}{{\text{total count}}}\sum_{p \in P}\text{count within}(p) \\ &= \frac{\text{total count}}{\text{total count}}, }$$

which is equal to unity, as it should. (In the previous image, the histogram heights range from $0$ almost up to $3$; the total volume is $1$.)

To get a marginal density--say, along the x-axis--you slice that axis into bins at cutpoints $x_0 \lt x_1 \lt x_2 \lt \cdots \lt x_n$. (These are allowed to have unequal lengths.) Each bin $(x_i, x_{i+1}]$ determines a vertical slice of the 2D region (consisting of all points $(x,y)$ for which $x_i \lt x \le x_{i+1}$). Let's call this strip $S_i$. As with any (1D) histogram, compute (or estimate) the total probability within each bin and divide by the bin width to obtain the histogram value. The total probability is usually estimated as the the sum of probabilities in polygons intersecting that strip:

$$\Pr[x_i\lt x \le x_{i+1}] = \sum_{p \in P}\text{area}(p\cap S_i)\text{density}(p).$$

Dividing this value by $x_{i+1} - x_i$ gives the value for the histogram of the marginal distribution. Repeat for each bin.

Marginals

^{The x-marginal histogram is in blue and the y-marginal histogram is in red. Each has a total area of $1$.}

Related Solutions

Solved – Too big (?) histogram values when using normed histo options in SciPy and matplotlib

Those look okay to me.

The thing that you have to sum is the area of each histogram bar, base times height.

The bases there are about 5.723e-15

By eye, the height of a typical bar then is about 1.6e13, making a typical bar around 0.09 in area, give or take.

Which suggests that the total area will be close to 1

Solved – Python: “Normalizing” kde, so it always lines up with histogram

Small discrete sample. Here is a 'standard graphics' plot in R of your observations.

 x = c(0.0, 0.0, 0.0, 0.0, 0.5, -0.12500000000000003, 0.0, -0.4, 0.0, 0.25)

hist(x, br=10, prob=T, col = "skyblue2")
rug(x)

The argument prob=T makes a histogram on a density scale in which the total area of the histogram bars sums to unity.
The argument br=10 'suggests' approximately ten bins, to provide a more reasonable match to the values in this extremely small and discrete sample; the default would give five bins.
The command rug puts small tick marks along the horizontal axis to show locations of observations within bins.

 lines(density(x), lwd=2, col="brown")

The lines statement overlays the default kernel density estimator (KDE) of the density procedure onto the histogram. One can change the bandwidth of the KDE with an appropriate argument.

In my experience, the area under KDE curves, made with the default density in R, is very nearly unity. Thus KDE's are calibrated to facilitate easy comparison with density histogram.

Of course, such comparisons are more fruitful with relatively large samples from continuous distributions, than with small discrete samples.

Large, continuous sample. For example, here is a histogram of a sample of size $n = 5000$ from $\mathsf{Gamma}(6, 1).$ The broken brown curve is the default KDE and the thin black curve is the density of the population.

y = rgamma(5000, 6, 1)
hist(y, prob=T, col="skyblue2", main="Histogram of sample of 5000 from GAMMA(6, 1)")
curve(dgamma(x, 6, 1), 0, 20, add=T) # `curve` requres fcn of `x`
lines(density(y), lwd=2, col="brown", lty="dashed")

Summary: The point is not to show that 'R is better than Python', but my knowledge of R is greater than my knowledge of Python, so I am more likely to make useful statements about R.

The point is to highlight characteristics of samples, and specifications for histograms and KDE's that can lead to fruitful comparisons. (Presumably there are ways to control the appearance of histograms and KDE's in Python roughly to match what I have shown in R.)

Addendum: With enough adjustments of bin widths and centers of the histogram and the kernel shapes and bandwidth of the KDE, I suppose you can get the KDE to emulate the histogram of almost any data as you seem to be suggesting.

x = c(0.0, 0.0, 0.0, 0.0, 0.5, -0.12500000000000003, 0.0, -0.4, 0.0, 0.25)
cutp = seq(-.42, .6, by=.05)
hist(x, br=cutp, prob=T, ylim=c(0,17), col = "skyblue2");  rug(x)
lines(density(x, adjust=.1))

Best Answer

Related Solutions

Solved – Too big (?) histogram values when using normed histo options in SciPy and matplotlib

Solved – Python: “Normalizing” kde, so it always lines up with histogram

Related Question