Proportion Data Transformation – Beyond Arcsin Square Root

data transformationgeneralized linear modelheteroscedasticity

Is there a (stronger?) alternative to the arcsin square root transformation for percentage/proportion data? In the data set I'm working on at the moment, marked
heteroscedasticity remains after I apply this transformation, i.e. the plot of residuals vs. fitted values is still very much rhomboid.

Edited to respond to comments: the data are investment decisions by experimental participants who may invest 0-100% of an endowment in multiples of 10%. I have also looked at these data using ordinal logistic regression, but would like to see what a valid glm would produce. Plus I could see the answer being useful for future work, as arcsin square root seems to be used as a one-size-fits all solution in my field and I hadn't come across any alternatives being employed.

Best Answer

Sure. John Tukey describes a family of (increasing, one-to-one) transformations in EDA. It is based on these ideas:

  1. To be able to extend the tails (towards 0 and 1) as controlled by a parameter.

  2. Nevertheless, to match the original (untransformed) values near the middle ($1/2$), which makes the transformation easier to interpret.

  3. To make the re-expression symmetric about $1/2.$ That is, if $p$ is re-expressed as $f(p)$, then $1-p$ will be re-expressed as $-f(p)$.

If you begin with any increasing monotonic function $g: (0,1) \to \mathbb{R}$ differentiable at $1/2$ you can adjust it to meet the second and third criteria: just define

$$f(p) = \frac{g(p) - g(1-p)}{2g'(1/2)}.$$

The numerator is explicitly symmetric (criterion $(3)$), because swapping $p$ with $1-p$ reverses the subtraction, thereby negating it. To see that $(2)$ is satisfied, note that the denominator is precisely the factor needed to make $f^\prime(1/2)=1.$ Recall that the derivative approximates the local behavior of a function with a linear function; a slope of $1=1:1$ thereby means that $f(p)\approx p$ (plus a constant $-1/2$) when $p$ is sufficiently close to $1/2.$ This is the sense in which the original values are "matched near the middle."

Tukey calls this the "folded" version of $g$. His family consists of the power and log transformations $g(p) = p^\lambda$ where, when $\lambda=0$, we consider $g(p) = \log(p)$.

Let's look at some examples. When $\lambda = 1/2$ we get the folded root, or "froot," $f(p) = \sqrt{1/2}\left(\sqrt{p} - \sqrt{1-p}\right)$. When $\lambda = 0$ we have the folded logarithm, or "flog," $f(p) = (\log(p) - \log(1-p))/4.$ Evidently this is just a constant multiple of the logit transformation, $\log(\frac{p}{1-p})$.

Graphs for lambda=1, 1/2, 0, and arcsin

In this graph the blue line corresponds to $\lambda=1$, the intermediate red line to $\lambda=1/2$, and the extreme green line to $\lambda=0$. The dashed gold line is the arcsine transformation, $\arcsin(2p-1)/2 = \arcsin(\sqrt{p}) - \arcsin(\sqrt{1/2})$. The "matching" of slopes (criterion $(2)$) causes all the graphs to coincide near $p=1/2.$

The most useful values of the parameter $\lambda$ lie between $1$ and $0$. (You can make the tails even heavier with negative values of $\lambda$, but this use is rare.) $\lambda=1$ doesn't do anything at all except recenter the values ($f(p) = p-1/2$). As $\lambda$ shrinks towards zero, the tails get pulled further towards $\pm \infty$. This satisfies criterion #1. Thus, by choosing an appropriate value of $\lambda$, you can control the "strength" of this re-expression in the tails.