If you're interested in comparing means, once you transform you end up with a comparison of things that are not means. If the right assumptions hold you can still test for a difference, but the alternative won't be location-shift.
I didn't want the details to detract form the general point.
On the other - and more important - hand, if you omit essential details you'll be more likely to end up with less useful - or even potentially misleading - answers that you won't even realize aren't the answers you need.
By leaving out the fact that you were dealing with count data, you were risking exactly that. While leaving out unnecessary detail is probably useful, knowing it's count data is pretty much central to the problem.
There are techniques for comparing means that are suitable for count data. With some more information about the kind of analysis/information you were after (even if it's what you would have done if the data were normal), we may be able to guide you better.
Transformation is less useful than doing something suited to your actual data.
It's not clear from your question why you need to transform at all.
(What are you trying to achieve and why?)
As for why logs might make the appearance more symmetric in some cases and not others, not all distributions are the same - while log transformations may sometimes make skewed data nearly symmetric, there's no guarantee that it always does.
Often other transformations do much better.
For example logs work very nicely on lognormal distributions, while cube roots do better on gamma. Below, $a$ is simulated from a lognormal distribution, and $b$ from a gamma distribution. They look vaguely similar, but the log-transform makes $a$ symmetric (in fact, normal), while making $b$ left-skewed. On the other hand a cube root transformation leaves $a$ still somewhat right skew, but makes $b$ very nearly symmetric (and pretty close to normal):
Other times there's simply no monotonic transformation to achieve approximate symmetry (e.g. if your distribution is discrete and sufficiently skew, like a geometric(0.5), or say a Poisson(0.5), no monotonic transformation can make it reasonably normal - wherever you put them, the leftmost spike will always be taller than the next one).
Incidentally, you might want to use more bars on your histograms, and maybe consider using other displays as well, to get a handle on the distributional shape. See my cautionary tale.
Best Answer
The design goals of the family of Box-Cox transformations of non-negative data were these:
The formulas should be simple, straightforward, well understood, and easy to calculate.
They should not change the middle of the data much, but affect the tails more.
The family should be rich enough to induce large changes in the skewness of the data if necessary: this means it should be able to contract or extend one tail of the data while extending or contracting the other, by arbitrary amounts.
Let's consider the implications of each in turn.
1. Simplicity
Linear transformations--those of the form $x\to \alpha x + \beta$ for constants $\alpha$ and $\beta$--merely change the scale and location of data; they cannot change the shape of their distribution. The next simplest formula is to consider power transformations, of the form $x\to x^\lambda$ for (nonzero) constant $\lambda.$
2. Stability
A power transformation enjoys the nice property that rescaling the data results in rescaling their powers. That is, multiplying the data $x$ by some positive scale factor $\alpha$ results in multiplying $x^\lambda$ by $\alpha^\lambda.$ OK, it's not the same scale factor, but it is still just a rescaling.
In light of this, let's always standardize any batch of data $(x_1, x_2, \ldots, x_n)$ by rescaling it to place its center (perhaps its median) at $1.$ Specifically, this replaces each $x_i$ by $x_i$ divided by the middle value of all the $x$'s. This won't change the shape of the data distribution--it really amounts to choosing a suitable unit of measurement for expressing the values. For those who like formulas, let $\mu$ be the median of the batch. We will be studying the transformations
$$x \to \frac{(x/\mu)^\lambda - 1}{\lambda} = \frac{\mu^{-\lambda}}{\lambda}\,x^\lambda + \frac{-1}{\lambda} = \alpha\, x^\lambda + \beta$$
for various $\lambda.$ The effects of $\alpha$ and $\beta$ (which depend on $\lambda$ and $\mu$) on $x^\lambda$ do not change the shape of the distribution of the $x_i^\lambda.$ In this sense, the Box-Cox transformations of the standardized data really are just the power transformations.
Because we have made $1$ the central value of the batch, design crition 2--"stability"--requires that different values of the power $\lambda$ have relatively little effect on values near $1.$
Let's look at this in a little more detail by examining what a power does to numbers near $1.$ According to the Binomial Theorem, if we write $x$ as $x=1+\epsilon$ (for fairly small $\epsilon$), then approximately
$$(1 + \epsilon)^\lambda = 1 + \lambda \epsilon + \text{Something}\times \epsilon^2.$$
Ignoring $\epsilon^2$ as being truly tiny, this tells us that
In light of this, we can match the effects of different possible $\lambda$ by means of a compensating division of the distance by $\lambda.$ That is, we will use
The numerator is the (signed) distance between the power transform of $x$ and the power transform of the middle of the data ($1$); the denominator adjusts for the expansion of $x-1$ by the factor $\lambda$ when taking the power. $\operatorname{BC}_\lambda$ is the Box-Cox transformation with parameter $\lambda.$
By means of this construction, we guarantee that when $x$ is close to a typical value of its batch of data, $\operatorname{BC}_\lambda(x)$ will approximately be the same value (and close to zero) no matter what $\lambda$ might be (within reason, of course: extreme values of $\lambda$ can do extreme things).
3. Flexibility
We have many possible values of $\lambda$ to choose from. How do they differ?
This can be explored by graphing the Box-Cox transformations for various $\lambda.$ Here is a set of graphs for $\lambda \in \{-1,-1/2, 0, 1/2, 1, 2\}.$ (For the meaning of $\lambda=0,$ see Natural Log Approximation elsewhere on this site.)
The solid black line graphs the Box-Cox transformation for $\lambda=1,$ which is just $x\to x-1.$ It merely shifts the center of the batch to $0$ (as do all the Box-Cox transformations). The upward curving pink graph is for $\lambda=2.$ The downward curving graphs show, in order of increasing curvature, the smaller values of $\lambda$ down to $-1.$
The differing amounts and directions of curvature provide the desired flexibility to change the shape of a batch of data.
For instance, the upward curving graph for $\lambda=2$ exemplifies the effect of all Box-Cox transformations with $\lambda$ exceeding $1:$ values of $x$ above $1$ (that is, greater than the middle of the batch, and therefore out in its upper tail) are pulled further and further away from the new middle (at $0$). Values of $x$ below $1$ (less than the middle of the batch, and therefore out in its lower tail) are pushed closer to the new middle. This "skews" the data to the right, or high values (rather strongly, even for $\lambda=2$).
The downward curving graphs, for $\lambda \lt 1,$ have the opposite effect: they push the higher values in the batch towards the new middle and pull the lower values away from the new middle. This skews the data to the left (or lower values).
The coincidence of all the graphs near the point $(1,0)$ is a result of the previous standardizations: it constitutes visual verification that choice of $\lambda$ makes little difference for values near the middle of the batch.
Finally, let's look at what different Box-Cox transformations do to a small batch of data.
Transformed values are indicated by the horizontal positions. (The original data look just like the black dots, shown at $\lambda=1,$ but are located $+1$ units to the right.) The colors correspond to the ones used in the first figure. The underlying gray lines show what happens to the transformed values when $\lambda$ is smoothly varied from $-1$ to $+2.$ It's another way of appreciating the effects of these transformations in the tails of the data. (It also shows why the value of $\lambda=0$ makes sense: it corresponds to taking values of $\lambda$ arbitrarily close to $0.$)