Solved – Data transformations to increase variance

data transformation

Here's a brief snippet of my data (the first two columns are measured spectral data attributes, and the latter columns are indexes calculated from the first two):

enter image description here

The issue I'm running into is that the variance of the two measured attributes are drastically different (first column StdDev is <.5% of mean, second column StdDev is about 9% of mean). Consequently, when I calculate any index attributes based on these values, the result is essentially completely determined by the second column–the first column is almost a constant.

There are common techniques for reducing high variance in data (log or sqrt transformations…) but I'm not sure what's a legitimate approach to increasing variance in an attribute? This must be a common issue though, right? I thought about standardizing the attributes, but that results in nonsense when I calculate indexes, because if the denominator is an "average" data point, the Z-score is near zero, resulting in huge values.

Other things I've considered, but am not sure about the mathematical legitimacy of:

  1. Using a negative log transformation, or a negative power transformation. This seems mathematically reasonable, but I haven't seen it done.

  2. Simply subtracting a constant from the column with low variance. For example, the attribute I'm concerned about basically has only values between 246 and 248.I could just subtract 240 from all the values, and that would dramatically increase the variance. But it also would be problematic from a mathematical perspective, right? This data indicates the strength of an electrical signal from a sensor instrument. It is quantitatively connected to a real phenomena in the real world (spectral properties of tree needles) so the values are meaningful, in the sense that a 1% increase in the value = a 1% increase in signal from the sensor. So, if I remove a constant to increase the variance, I think I'd be loosing important quantitative information?

That's all I've come up with so far. Anybody have mathematical support (or criticism) for either of these suggestions? Or another approach for dealing with this issue?

Best Answer

First, since these are physical values with real meaning then you may want the behavior you are trying to eliminate. If all the items are essentially the same on NIR then why should it play much of a role in creating an index?

I think this is why our approach of standardizing the scores yields nonsense.

Second, if you decide that you do want them both to contribute equally, you'd have to say exactly what you mean by "contribute equally". Do you mean they should have equal variance? If you square the values in the first column, the sd will increase, but not by much as a proportion of the mean:

set.seed(12345)

x <- rnorm(1000, 200, 1)
sd(x)/mean(x)  #0.004
x2 <- x*x
sd(x2)/mean(x2)  #0.009

so you'd have to make some absurd transformation to get the proportion the same as the other variable.