Solved – Crises: ANOVA? Or: how to analyse non-normal, non-Homogeneity data with different group sizes

anovaheteroscedasticitynormal distribution

I'm somewhat new the world of statistics, or at least it has been years since I last used it and basically the only program I know generally how to work with is SPSS. However, for my Master thesis I need to analyse a dataset I made, so there's no escaping it now.

My research is about whether there is a difference in the provision of expertise to the European Commission by interest groups, which can either represent diffuse or business interest. I gather my data from the written submissions to the Commission. I've have created two indicators (my dependent variables) for expertise: the number of Total References (compiled out of five subsections which are based on the type of references made) and the amount of words in the document dedicated to 'Causal Stories'. I also have a number of control variables. In total I have 120 observations of written submissions coded. I'm using SPSS.

What I have is the following data:
Independent Variable:
Interest group (coded as 0 for business and 1 for diffuse interest groups)

Dependent Variable:
The number of References in the document (when adding up all the subsections the Total References range from 0 to 93) and the amount of words dedicated to telling a Causal Story (ranging from 0 to 3002).

I want to control for certain things namely:
The amount of Staff (ranging from 1 to 1500 persons)
The amount of Money they spend on lobbying (which I recoded into categories 1-35, 1 is spending €50.000, 35 is spending over 5 million Euro)
And interest in various Issue Areas (coded as no/yes, 0/1)

I was planning on using AN(C)OVA however…

According to the Kolmogorov-Smirnov test and the Shapiro-Wilk test, neither by References nor by Causal Stories are normally distributed. I've tried logging it but though the curve does look more bell shaped in the case of the Causal Stories, both still score a .000 score. Additionally, logging it caused the observations which scores a 0 to suddenly go System Missing, while they should remain included in the dataset or otherwise my number of observations in the References DV goes down. I know I can prevent this by either +1 all the variables or recode the System Missing back into 0's, but I have no idea whether this is 'legal' statistically.

According to Levene's Test, my data (References and Causal Stories) also violates the Homogeneity of Variance.

My Statistics book tells me that the latter might not cause a problem for ANOVA, as long as the group sizes are equal. However, my group sizes are not. I have 30 diffuse interest groups and 90 business groups. According to the book (Field, Discovering Statistics) I can still use Welch F in this case, which I found is not significant neither for References (though it is in some subsections of References) or Causal Stories (however, this does not include any of the control variables).

I have looked at other tests like the Kruskal-Wallis test which is nonparametric, but it doesn't allow me to work with more than one explanatory variable, so I can't measure the effect of control variables such as Lobbying Expenditure or Staff.

So I'm somewhat at a loss here. I asked my thesis supervisor to help me, however I'm not sure what he is doing is any better than what I've tried to do so far. He logged the depended variables and without checking whether there was an improvements, he ran a regression using Stata with only 46 observations (in the case of one of the subsections of my DV References there were a lot of Interest Groups which scored zero, which became System Missing) and said these were acceptable results. Now I'm not an expert on statistics and I've never work with Stata, but I have a feeling there is a lot wrong with those results.

I hope this has made sense. I've checked this site multiple times the past week in order to try to solve the problems myself, but so far I haven't been able to. Any help would be greatly appreciated!

Justin

Best Answer

Your dependent variables (number of references or words in the causal story) seem to be counts, which would call for, e.g., Poisson or negative binomial regression. See Why is Poisson regression used for count data? for a bit more background. As an alternative to these regression methods, you could use some sort of a transformation.

Using log(references + 1) transformation is legitimate in a sense that it is at least often used (and sometimes even defensibly so, such as the case of log-normally distributed dependent variable). As an alternative to log you might also consider square root transformation, sqrt(references). However, transformations often perform poorly compared to the Poisson or negative binomial regression.

Poisson and negative binomial regression are available in SPSS, but they require at least the advanced statistics add-on. See http://www.ats.ucla.edu/stat/dae/ for examples of the analyses.

A word of warning, though. There are potential problems in count data analysis that you might encounter, probably the most prominent being over-dispersion, or the excess of zeros in the dependent variables.

Related Question