Solved – Multiple regression analysis with spatial data as independent variable

categorical datamultiple regressionnonlinear regression

In my PhD thesis I am working on spatial modeling of different chemical parameters in groundwater, and for spatial modeling I am also using the multiple statistical approach.
I have a question about multiple regression analysis. (Or it is better to use polynomial regression?)

The equation for spatial regression modeling is: $Y = α +
β_1x_1 + β_2x_2 +…. + β_ix_i + ε$

For my dependent variable, I have concentrations of calcium in groundwater, which were measured from different sampling points in the entire research area. For the independent variable, I choose the spatial data that influence the distribution of calcium in groundwater. I have lithology, vegetation, slope, climatic conditions (temperature, precipitation), depth of soil, …

The problem is that lithology and vegetation are categorical data (lithology = 3 categories from 1 to 3, where 1 means clastic rocks, 2= carbonate rocks and 3= metamorphic and igneous rocks; and vegetation = 4 categories (1= bare rocks, 2= agriculture land, 3= grassland, 4= forests); all others variables are numerical and continuous.

Do you have any idea how to solve the problem with categorical data in multiple regression analysis? Might it be better to use some other method?
Best regards and thank you very much for your help.

Best Answer

It's a little unclear what your objectives are, so other methods might be preferable depending on those. Polynomial regression may suit continuous variables, but wouldn't make sense for categorical ones. You can add higher-order terms for the continuous variables alongside categorical predictors though.

Nominal predictors can be added to a multiple regression model using dummy codes. In your case, you could enter lithology as two dummy variables: using clastic rocks as the reference group (for example; not necessarily the one you want to choose), you could create one binary variable indicating whether a case involves carbonate rocks (1 if so, 0 if not), and another equivalent one for metamorphic/igneous rocks. The same process of dummy coding can work for any number of levels, and won't be too much harder to interpret for vegetation. Here's an example dataset: $$\begin{array}{c|cccccc}\rm Case&\rm Carbonate&\rm Metamorphic/Igneous&\rm Farm&\rm Grass&\rm Forest&...\\\hline\small\rm Clastic\ bare&0&0&0&0&0&...\\\small\rm Carbonate\ bare&1&0&0&0&0&...\\\small\rm Igneous\ bare&0&1&0&0&0&...\\\small\rm Clastic\ farm&0&0&1&0&0&...\\\small\rm Carbonate\ farm&1&0&1&0&0&...\\\small\rm Igneous\ forest&0&1&0&0&1&...\\...&...&...&...&...&...&... \end{array}$$See how that works? (I can elaborate if not.) Just enter these binary predictors like any other. The corresponding $\beta$s represent the differences between given groups and the reference group.
E.g., $\beta_{\rm Carbonate}$ represents the difference between carbonate and clastic in the above example.