Solved – Displaying frequency when using kernel density estimation

kernel-smoothingstata

I am trying to plot a kernel density of a single variable in Stata where the y-axis is displayed as a frequency rather than the default density scale. For a histogram, this is trivial; the syntax is:

histogram x, frequency

Furthermore, if I wished to plot a histogram with a kdensity overlaid and maintain the frequency scaling, I would type:

histogram x, frequency kdensity area(n)

where n is the number of non-missing observations.

However, I cannot find a command to plot a simple kdensity on a frequency scale. I imagine a workaround could be to draw the histogram with the frequency option and set the colour to white (so it is invisible against the background) and overlay a scaled kdensity, but this seems a little cumbersome.

Best Answer

There is a statistical issue hidden inside this question, which is otherwise off-topic here given its focus on Stata code. The OP probably doesn't need this explanation, but it may help others and is added here to try to shift the question closer to this forum.

Consider histograms first and for simplicity focus on the case in which bins are of equal width. When histograms in any language show frequencies (counts) it is tacit that the scale is really frequency in bins of stated width, and familiar that as you increase or decrease the bin width, frequencies will typically increase or decrease accordingly.

On a plot showing (kernel or other) density estimates the conventional scale becomes probability per unit of measurement, where the unit of measurement will be metres or US dollars or kg or whatever is used. Although estimated at discrete points the density is shown as a continuous curve, and the principle is that the area under the curve integrates to 1, the total probability in the distribution.

Nothing stops you using different units, but it is meaningless to ask for frequencies pure and simple. The scale could only be frequency per unit of measurement. How to get that shown depends on the language or environment you are using but in Stata, you would need to do calculations off-stage, multiplying the densities shown as axis labels by the total frequency in the distribution and ensuring that frequencies per unit $=$ total frequency $\times$ probability per unit are the axis labels the reader sees.

Hint: mylabels from SSC is a Stata way to get axis labels shown on a different scale from that used by a plot command.

Related Question