GLM – Should Numerical Variables be Transformed to Categorical Variables in GLM?

biostatisticscategorical datageneralized linear modelr

I'm building a GLM with couple of variables and I have a problem with how to organize my data. I have both numerical and categorical data and I'm struggling as to how should I structure 3 variables.

I build a model to see if a toxin in fish is influenced by environmental and biological data. All biological variables such as sex, growth, condition etc. are measured per sample. However, I want to include in the model environmental variables that have the same values for all individuals (samples) in each population. Here I give an example:

Lake toxin V1 V2 V3 V7 PC1 PC2 PC3
Lake1 6.642662 -24.40677 8.175274 6.626065 2 -4.391045706 -0.709131522 1.115037248
Lake1 6.237877 -23.35143 7.214446 7.598336 2 -4.391045706 -0.709131522 1.115037248
Lake2 7.131938 -25.206 9.587 4.296624 1 -3.052061784 -0.634795567 0.615332691
Lake2 7.106172 -24.677 9.998 6.047108 2 -3.052061784 -0.634795567 0.615332691
Lake2 7.634661 -25.758 10.095 8.383605 1 -3.052061784 -0.634795567 0.615332691
Lake3 8.066581 -26.906 10.433 3.988736 2 -3.104092579 0.303914076 0.271016783
Lake4 6.217926 -29.099 6.499 5.39643 2 -2.723297999 -0.068871926 -1.89359307

The table presents 7 samples from 4 lakes. I want to test if the toxin can be influenced by all factors included in the table (from column V1 to column PC3). Variables V1, V2 and V3 are numerical (i.e. growth, condition). Variable V7 (i.e. sex) is categorical. All 4 biological factors are unique to each sample. Now, I have PCA values that describe environmental variables. PC1 describes climatic gradient, PC2 – catchment properties and PC3 – lake characteristics. These variables are unique per lake, so the values repeat for all samples in each population (lake). The numbers themselves don't matter, what's important is the distance between numbers, that describes i.e. how much warmer and more productive is Lake1 from Lake2 on a scale of all lakes.

The question is, can I use PC1, PC2 and PC3 as.numeric() in the GLM model? Even though the variables are not continuous, but rather have class / are grouped? Or should I use PC1, PC2 and PC3 as.factor()? If I use all PCA as a factor then the order is important because the lowest value of PC1 is in Lake1, but the lowest PC2 value has Lake4.

Or perhaps both are wrong and there is a different approach?
I'd appreciate any help.

Best Answer

There's no problem with a "continuous" predictor having only a fixed set of values. That happens all the time in designed experiments. What you won't be able to do is include both Lake and your PCs as predictors, as when you know the set of PCs you also know the Lake--that set of predictors isn't linearly independent.

That said, I also agree with the interpretability issue raised in a comment by @danlooo. Principal components can be hard to interpret, although I appreciate that PCA is often used in this type of study. Consider, perhaps, a model that uses ridge regression on the environmental variables while keeping the other predictors unpenalized. That's basically a weighted principal-component regression rather than an all-or-none selection of components, and would more directly provide (penalized) coefficients for each of the environmental variables.

With respect to coding Sex (V7): the 1/2 coding you show might lead to trouble when interpreting the intercept of your model; the intercept might be the predicted outcome value when Sex = 0. The model and predictions from it will still be OK, but you and your readers might be less confused if you use 0/1 coding for dichotomous predictors.