Solved – Categorical variables and continuous response

categorical data

Within three distinct groups (three soil types), I have three categorical predictor variables (pH, ionic strength, and SAR). My response variable is continuous (% optical transmission). What statistical analysis do I need to check for interactions between the predictor variables and significant differences in the response? I also want to compare between soil types, and how the effects of the predictor variables on the response variable vary among soil types.
Thanks.

Best Answer

Sounds like a simple case for multiple regression. The comment is correct: the predictors you mention are only categorical if you've discretized them for whatever reason. If you have access to the un-discretized data, you might consider some semi-parametric estimators.

One complication that you might face is the fact that your data are undefined above 1 or below zero, given that it is a proportion. I know of three ways of dealing with this:

  1. Just run an OLS regression $y=\alpha+X'\beta + \epsilon$ (where $X$ is a matrix of your three variables AND multiplicative interaction terms that you deem important (e.g. pH $\times$ SAR). Check to see whether any of your predicted values $\hat{y}$ come close to or exceed 0 or 1. Or whether the standard errors of predictions come close to zero or 1. If not, you can probably get away with just running OLS, even though you are violating the OLS assumption of a normally-distributed error term. Furthermore, the even more important assumption of a linear relationship might not make physical sense.

  2. GLM: use a logit link function -- coefficients on variables then will give estimates of marginal change in the dependent variable on the logistic scale. The advantage here is that the predicted values cannot be outside the physically possible range, but that in itself does not guarantee this to be the best model. The details here require care, but see http://www.stata-journal.com/sjpdf.html?articlenum=st0147 for a concise introduction.

  3. Beta-regression. The beta distribution is a family of occasionally symmetric but usually non-symmetric bell-shaped curves defined between zero and one. I haven't used this myself, but it is designed for problems like what you're describing.

Probably the best way forward is to run all three and confirm that the choice of modeling specification does or doesn't change your results. If it does, you need to pick the one that makes the most sense. If it doesn't, you're good.

If you've got the original, non-discretized data, consider semi-parametrics, such as can be found in mgcv in R. This could mitigate some of the functional form worries about running a logistic regression -- if a variable causes a linear response, the response will be non-linear on the logit scale. Allowing the functional form to be arbitrary will reduce mis-specification bias.