Data Transformation – Log(x * Constant) Transformation

data transformationdata visualizationlogarithm

I am a conservation biologist and I am creating a new index to classify species extinction risk. The index ranges from one to zero, with scores closer to one representing a higher extinction risk. Due to the distribution of the data, the calculated values are always very skewed: few species have a value close to one and most species have a value close to zero.

This pattern seems to be true, as a few species do have everything going wrong for them, which results in a disproportionate extinction risk, but this does not mean other species are not at risk as well. Moreover, this skewed distribution makes it hard to visualize any geographical or phylogenetic pattern in extinction risk.

To fix such issues, I decided to transform the data with the classic log(x+1) transformation, but that did not help a lot. I tried different transformations and what gave me the most satisfactory results was a log(x*constant) transformation, in which the constant is an arbitrarily large number, such as 10^100.

Trying to anticipate reviewers' objections, I looked for other examples of such transformation being used in my field, but could not find any.

My questions are:

  1. Is this transformation something very unusual, or is it employed in any other fields?
  2. Is there any reason a reviewer could object to its use?
  3. Is there a more usual transformation that would yield similar results?

Edit, for clarification:

I am not doing a regression. I calculate an index based on several threats, such as deforestation and climate change and this index is the final product. If I want to display it on a map, for example, the skewness in the index will be a problem. The map will show only one region with a very high value (red) and the rest of the world with low values (blue), and I won't see any meaningful patterns there. Using log(x+1) shows a very similar map. Using log(x*constant) gives a more nuanced map, with different regions colored in different shades of red/blue, which is much more useful.

Best Answer

[Update]

Now that we know the question is about visualization and not analysis, the mathematical properties of the logarithm seem less relevant than how people perceive log-scaled data.

The goal of visualization is not to produce a colorful map but to highlight the message(s) you want the audience to take away. If there are a handful of species at extreme risk, a map with a few spots of concentrated red color might be an effective way to underline the urgency of the threat. So I challenge your claim that the skewness of risk indices is an issue to be fixed and that lots of colors in a plot make that plot more useful.

That comment aside, the original (0,1) scale and the log transformed scale both have advantages and disadvantages.

The original (0,1) scale:

  • Has a well-defined maximum and minimum value.
  • May be hard to interpret even though the numbers are between 0 and 1. What does risk = 0.5 tell us about species X? What about a risk difference of 0.1 between species X and species Y? Is difference(risk) = 0.9 - 0.8 = 0.1 (at the high risk end) comparable to difference(risk) = 0.2 - 0.1 = 0.1 (at the low risk end).

The log transformed scale:

  • Doesn't have a well-defined minimum. Since log(0) is undefined, you have to choose an arbitrary strictly positive minimum, say log(0.0001).
  • The log transformation shows relative rather than absolute values. If the extinction risk index has no natural units, relative values might be more interpretable. For example, "species X is at twice the risk of species Y" is easy to understand. And you can use a species that's known to everyone as a kind of baseline.

Note: If you decide to log-transform, you should use $\log_2$ or $\log_{10}$ rather than the natural logarithm, $\log_e$, which is often the default. Then you can use the 2x or 10x interpretation; $e$x is harder to grasp.

I suggest you reconsider your choice of color palette. You mention red and blue colors which means you are using a divergent "blue-white-red" palette. Since no amount of extinction risk is good news for a species, it's more natural to use a sequential "white-red" palette. Keep in mind that in the caption you have to explain what the figure(s) say about extinction risk. So interpretability is more important than aesthetically pleasing colors.

Here is a simple demonstration of the difference between divergent and sequential colors. I plot extinction risk on the x-axis and log2(risk) on the y-axis. The risk values range from 0.001 to 1, to avoid taking the log of zero.

enter image description here

In the two top rows, color is "unevenly" distributed because the log changes very quickly for small numbers and more slowly for large numbers. The divergent colors are harder to explain, particularly on the log scale where the white color corresponds to risk of $2^{-5}$ = 0.03125. Is there something special about risk of 0.03125 to use it to anchor the color scale? [In your own plot white might correspond to a different risk value — because of the log(const x risk) transformation — but probably just as arbitrary.]

All in all, the next steps might be to choose a meaningful palette and make sure the graphing software is not choosing the color range for you. And if you have multiple message to convey in your paper, you might need multiple plots to express them.

Appendix

You also mention the log(x+1) transformation and that it shows a very similar map. Here is why:

enter image description here

I've plotted risk and log2(risk+1) to show that the correspondence is approximately 1:1, except for a slight curvature most noticeable in the mid-range of risk values. Not only is the log2(risk+1) harder to interpret than either risk or log2(risk); it's also rather ineffective.


[Original answer]

This question seems to say, in part: "I have complex data (an index on species extinction risk) and I want to find an effective and intuitive graphics to represent the richness of my data."

I suspect that the most effective visualization would be to create multiple plots, at different levels of detail (eg., most threatened species, least threatened, somewhat in-between).

For inspiration I suggest the following references:

[1.] J. Schwabish. Better Data Visualizations A Guide for Scholars, Researchers, and Wonks. Columbia University Press, 2021.
[2.] Our World in Data. A quick search through the site came up with an article on Extinctions.

Related Question