Densities can be hard to work with. Whenever you can, calculate with the total probabilities instead.
Usually, histograms begin with point data, such as these 10,000 points:
A general 2D histogram tessellates the domain of the two variables (here, the unit square) by a collection $P$ of non-overlapping polygons (usually rectangles or triangles). To each polygon $p$ it assigns a density (probability or relative frequency per unit area). This is computed as
$$\text{density}(p) = \frac{\text{count within}(p)}{\text{total count}} / \text{area}(p).$$
The $\frac{\text{count within}(p)}{\text{total count}}$ part estimates the probability of $p$; when it is divided by the area of $p$, you get the density.
In this 2D histogram, the unit square has been tessellated by rectangles of width $1/26$ and height $1/11$.
2D histograms represent probability (or relative frequency) by means of volume: for each polygon $p$, the product of height and base, or density * area, returns $\frac{\text{count within}(p)}{\text{total count}}$. As a check, the total probability is obtained by summing the volumes over all polygons:
$$\eqalign{
\text{Total probability} &= \sum_{p \in P}\text{area}(p)\text{density}(p) \\
&= \sum_{p \in P}\frac{\text{count within}(p)}{\text{total count}} \\
&= \frac{1}{{\text{total count}}}\sum_{p \in P}\text{count within}(p) \\
&= \frac{\text{total count}}{\text{total count}},
}$$
which is equal to unity, as it should. (In the previous image, the histogram heights range from $0$ almost up to $3$; the total volume is $1$.)
To get a marginal density--say, along the x-axis--you slice that axis into bins at cutpoints $x_0 \lt x_1 \lt x_2 \lt \cdots \lt x_n$. (These are allowed to have unequal lengths.) Each bin $(x_i, x_{i+1}]$ determines a vertical slice of the 2D region (consisting of all points $(x,y)$ for which $x_i \lt x \le x_{i+1}$). Let's call this strip $S_i$. As with any (1D) histogram, compute (or estimate) the total probability within each bin and divide by the bin width to obtain the histogram value. The total probability is usually estimated as the the sum of probabilities in polygons intersecting that strip:
$$\Pr[x_i\lt x \le x_{i+1}] = \sum_{p \in P}\text{area}(p\cap S_i)\text{density}(p).$$
Dividing this value by $x_{i+1} - x_i$ gives the value for the histogram of the marginal distribution. Repeat for each bin.
The x-marginal histogram is in blue and the y-marginal histogram is in red. Each has a total area of $1$.
Welcome to CV!
Do the # of classes only have to be the same?
Not necessarily.
Do the histograms both have to start at the same point (0 or 20)?
Highly recommended, and the axes should also end at the same number, and better if they are of the same length as well.
More babbling: It depends what do you mean by "comparing." From just the two histograms I can definitely compare and contrast the skewness of the distribution. But beyond that, visually it's not easy to say anything about the frequencies at different age because the class widths are different, and what's more crucial, the x-axes are different.
A panelled histogram that can successfully enhance visual comparison should have common y and x axes like this example:
If you allow the x- and y-axes to extend to wherever they want, it's difficult to compare across graphs.
Are these histograms sufficient/correct?
By themselves they are correct, but if it's for cross-gender comparison it's insufficient.
What if I said it is mandatory to use the formulas above to find the # of classes and class width?
I'd have a hard time not laughing. For many reasons:
The formula was probably aimed for making one histogram, and not for making two or more then put together for comparison. It sounds to me like a shoe-smith always wants to use up the whole piece of leather in making a shoe, and he can never sell any pair cause no any two of them actually make a pair.
Histogram has more parameters than just sample size. The min and the max, the distribution of the people, the nuances one would like to highlight, etc. It's unfair to overlook all other important information.
When it comes to visualization of data, it's important to realize that we are to communicate the data, not package it and force them down the throat of the viewers'. I'd probably stay away from all these pragmatic "you have to do this and do that..." bossy instructions. (Though I give those instructions as well, guilty as charged.)
But(!), if your professor/supervisor--who can decide if you will pass/fail the course--said you have to use this formula, then I'll suggest pick your battle wisely.
Should I use the same class start for each in the histograms?
Not necessarily, but recommended because this form is easiest to perceive. There are histogram that uses unequal bin sizes. If for some reason a certain age categorization is more important in females than males, you may make them unequal.
For instance, if some country's legal drinking age is 18, and another is 21, you may group all <18 into one column for the first country, and <21 into another column for the second country. The question is that, would this be more meaningful? Or would you see the same pattern if you bin them at single year age?
In a nut shell, you'll need to know what you want the viewers to know, and work backward. Avoid starting with recipes.
Best Answer
A recent paper that may be worth reading is:
Cao, Y. Petzold, L. Accuracy limitations and the measurement of errors in the stochastic simulation of chemically reacting systems, 2006.
Although this paper's focus is on comparing stochastic simulation algorithms, essentially the main idea is how to compare two histogram.
You can access the pdf from the author's webpage.