Solved – Non-normally distributed data – Box-Cox transformation

count-datadata transformationrepeated measuresspss

I have data that is not normally distributed. The problem seems to be that there are too many of one value relative to other values.

What I have tried to make data normal:

  • I have tried a log transformation, with adding a 1 to all values (i.e., the relatively frequent values are 0's and I understand one cannot perform a log transformation on values with 0's) but this did not fix the data and did not change the Shapiro-Wilk significance from p<.001 even slightly.
  • I also read that ANOVA does not need the dependent variable to be normally distributed but only for residuals to be normally distributed. However, residuals were still significantly different to each other as shown by another Shapiro-Wilk test.

Could a Box-Cox transformation make data normally distributed?

  • One source (page 27) suggests that using a Box-Cox transformation is another possible solution after the log transformation has not worked. However, the problems is that I am aware this function for a Box-Cox transformation is not available on SPSS.
  • I have not found a working syntax for this transformation, I am not familiar with R which I understand can perform the Box-Cox transformation and I have a deadline to submit my report by meaning I do not have time to get familiar with R, and I am not in a position to purchase Excel statistical software packages.

The data

  • It might be useful to add a histogram of the data to get a good idea of what it is like which is below.

Low Information Load Intrusions Condition

High Information Load Intrusions Condition

  • It is a repeated measure design where each participant had what is referred to in the histograms as low and high load conditions.
  • As predicted by our hypothesis the high load condition should lead to none to very few "images" (i.e., intrusions) and so the data support our hypothesis. Therefore, it seems strange that data which is expected to act in a certain way is problematic in terms of it's normality because in some cases it's expected to be not normal.
  • Perhaps I am misunderstanding something though or carried out an incorrect test so if so then it would be good to know.

Why not do nonparametric analyses?

  • I prefer to carry out parametric tests and hence why I want to make the data normally distributed; this might be an incorrect view, but from reading many articles I very rarely see papers which use nonparametric analyses when reported data.
  • Only if the data I have seem as if it is not possible to turn it into parametric data then I will have to use a nonparametric test but first I want to try the choices out there of transforming the data.

Questions

1) Is it a good idea to use a Box-Cox transformation to try fix the data, or other there better ideas that I haven't tried?

2) If a Box-Cox transformation is in fact a good idea to try out, are there any suggestions of how I could do this (e.g., whether it is information about SPSS syntax, an Excel formula etc.)?

Information from answers:

  • This is count data
  • It has a Poisson distribution and so an analysis with such a distribution would be ideal
  • There are many zeroes in the data with a small range (i.e., values from 0 to 4)
  • Good links are provided in the comments below by Scortchi for more information about transforming non-negative data with zeros, a test for significant difference between two repeated measures count variables, regression models for count data when counts are often zero, and using Poisson Repeated Measures ANOVA.
  • The Poisson RM ANOVA has very good information on the idea and theory behind using such a distribution for count, non-parametric data with many zeroes.
  • As my next step was to get an understanding of the specific area of carrying out a Poisson distribution analysis in SPSS for repeated measures data, I have asked a new question here.
  • Although the aim will be to carry out an analysis using a Poisson distribution, according to one source (half way through page, under Reason 6: Data Follows a Different Distribution) it seems that the data will still remain nonparametric.

(If however I should not have asked a separate but related question as I did please let me know and I will change this – I'm new so I'm still getting a feel for the conventions for posting here but I'm happy to learn and get feedback on how I post to make it clearer for all.)

Best Answer

The data are highly skewed & take just a few discrete values: the within-pair differences must consist of predominantly noughts & ones; no transformation will make them look much like normal variates. This is typical of count data where counts are fairly low.

If you assume that counts for each individual $j$ follow a different Poisson distribution, & that the change from low to high load condition has the same multiplicative effect on the rate parameter of each, you can extend the idea in significance of difference between two counts to a matched-pair design by conditioning on the total count for each pair, $n_j$:

$ \sum_{j=1}^m X_{1j} \sim \mathrm{Bin} (\sum_{j=1}^n n_j, \theta)$

where $m$ is the no. pairs. So the analysis reduces to inference about the Bernoulli parameter in a binomial experiment— 7 "successes" out of 24 trials if I read your graphs right.

Check the homogeneity of proportions across pairs—& note if they're too homogeneous it might indicate underdispersion (relative to a Poisson) of the original count variables.

Note that this approach is equivalent to the generalized linear model suggested for Poisson Repeated Measures ANOVA: while it tells you nothing about the nuisance parameters, point & interval estimates for the parameter of interest can be worked out on the back of a fag packet (so you don't need to worry about software requirements).

† Parametrize your model with the log odds $\zeta=\log_\mathrm{e} \frac{\theta}{1-\theta}$: then the maximum-likelihood estimator is $$\hat\zeta=\log_\mathrm{e}\frac{\sum x_{1j}}{\sum n_j - \sum x_{1j}}=\log_\mathrm{e}\frac{7}{24-7}\approx -0.887$$ with standard error $$\sqrt\frac{\sum n_j}{\sum x_{1j}(\sum n_j-\sum x_{1j})}=\sqrt\frac{24}{7\cdot(24-7)}\approx 0.449$$ for Wald tests & confidence intervals. If you want to adjust for over-/under-dispersion (i.e. use "quasi-Poisson" regression) , estimate the dispersion parameter as Pearson's chi-squared statistic (for association) divided by its degrees of freedom (22) & multiply the standard error by its square root.