Solved – Data transformations to increase variance

data transformation

Here's a brief snippet of my data (the first two columns are measured spectral data attributes, and the latter columns are indexes calculated from the first two):

The issue I'm running into is that the variance of the two measured attributes are drastically different (first column StdDev is <.5% of mean, second column StdDev is about 9% of mean). Consequently, when I calculate any index attributes based on these values, the result is essentially completely determined by the second column–the first column is almost a constant.

There are common techniques for reducing high variance in data (log or sqrt transformations…) but I'm not sure what's a legitimate approach to increasing variance in an attribute? This must be a common issue though, right? I thought about standardizing the attributes, but that results in nonsense when I calculate indexes, because if the denominator is an "average" data point, the Z-score is near zero, resulting in huge values.

Other things I've considered, but am not sure about the mathematical legitimacy of:

Using a negative log transformation, or a negative power transformation. This seems mathematically reasonable, but I haven't seen it done.
Simply subtracting a constant from the column with low variance. For example, the attribute I'm concerned about basically has only values between 246 and 248.I could just subtract 240 from all the values, and that would dramatically increase the variance. But it also would be problematic from a mathematical perspective, right? This data indicates the strength of an electrical signal from a sensor instrument. It is quantitatively connected to a real phenomena in the real world (spectral properties of tree needles) so the values are meaningful, in the sense that a 1% increase in the value = a 1% increase in signal from the sensor. So, if I remove a constant to increase the variance, I think I'd be loosing important quantitative information?

That's all I've come up with so far. Anybody have mathematical support (or criticism) for either of these suggestions? Or another approach for dealing with this issue?

Best Answer

First, since these are physical values with real meaning then you may want the behavior you are trying to eliminate. If all the items are essentially the same on NIR then why should it play much of a role in creating an index?

I think this is why our approach of standardizing the scores yields nonsense.

Second, if you decide that you do want them both to contribute equally, you'd have to say exactly what you mean by "contribute equally". Do you mean they should have equal variance? If you square the values in the first column, the sd will increase, but not by much as a proportion of the mean:

set.seed(12345)

x <- rnorm(1000, 200, 1)
sd(x)/mean(x)  #0.004
x2 <- x*x
sd(x2)/mean(x2)  #0.009

so you'd have to make some absurd transformation to get the proportion the same as the other variable.

Related Solutions

Solved – Standardization of compositional data in PCA versus using real data

First, whether you use the covariance matrix, or the correlation matrix (equivalent to standardizing each variable before carrying out PCA on the covariance matrix), or transform the data in any other way before carrying out PCA, the results of the PCA apply to that transformed data. So you should not be surprised to see different eigensystems using different transformations; any interpretations you may make may of course be different, but are are not conflicting. If they seem to conflict you must be misinterpreting them.

Second, whether it's more meaningful to express each variable as a fraction of the sum of variables for each individual is for you to decide, before thinking about principal components. If it is more meaningful, PCA on the data thus transformed may not be what you want: any one variable is expressible in terms of the other two, which are still constrained not to exceed unity in total. A scatterplot would be an obvious method to look at three variables, using barycentric co-ordinates if you like. If you still need PCA for something, Aitchison (1983), Biometrika 70 (1) discusses the issues, & gives useful transformations to use for vectors of proportions, & you may be interested in the R packages compositions & robCompositions.

Solved – What to do with non-normality and heterogeneous variances in two-way ANOVA when transformations do not work

Thanks for posting the data. Posting shows that the box plots concealed, although not intentionally, the sample sizes and important detail too. Whenever I see skewness on a positive response, my first instinct is to reach for logarithms, as they so often work well. Here, however, logarithms drastically over-transform, and plotting everything shows up a small surprise, namely that the two lowest values need care and attention.

The graph here is a quantile-box plot in which the original data points are plotted in order on scales consistent with the box idea (i.e. about half the points are inside the box and about half outside, the "about" being a side-effect of sample sizes like 11).

A more cautious square root transformation seems about right.

Personally I regard preliminary tests for normality and so forth as over-rated stuff left over from the 1960s. I feel far too queasy about forking paths of the form: pass the test OK, fail the test do something quite different, particularly with small sample sizes. Once you have a scale on which you have approximate symmetry and approximate equality of variances, linear models will work well.

Similarly, skewness and kurtosis from small samples can hardly be trusted. (Actually, skewness and kurtosis from large samples can hardly be trusted.) For some of the reasons see e.g. this paper

Indeed, some fits with generalised linear models with cohort and gender as indicator predictor variables show that results seem consistent over identity, root and log links, even despite the evidence of the first graph. If this were my problem I would push forward with a square root link function. In other words, although transformations are informative about the best scale to work on, you let the link function of a generalised linear model do the work.

Campaign slogan: Conventional box plots with a few groups leave out detail that could easily be interesting or useful and don't make full use of the space available. Use graphs that show more!

EDIT:

Here is token output: predicted values using generalised linear model, root link, normal family, interaction between cohort and females:

  +--------------------------------------+
  | cohort   females   predicted   Freq. |
  |--------------------------------------|
  |      1     males       2.056      12 |
  |      1   females       5.024      12 |
  |      2     males      12.712      11 |
  |      2   females      15.348      11 |
  +--------------------------------------+