Solved – Multiple regression with categorical and numeric predictors

generalized linear modelr

I am relatively new to R, and I am trying to fit a model to data that consists of a categorical column and a numeric (integer) column. The dependent variable is a continuous number.

The data has the following format:

predCateg, predIntNum, ResponseVar

The data looks something like this:

ranking, age_in_years, wealth_indicator
category_A, 99, 1234.56
category_A, 21, 12.34
category_A, 42, 234.56
....
category_N, 105, 77.27

How would I model this (presumably, using a GLM), in R?

[[Edit]]

It has just occurred to me (after analysing the data more thoroughly), that the categorical independent variable is in fact ordered. I have therefore modified the answer provided earlier as follows:

> fit2 <- glm(wealth_indicator ~ ordered(ranking) + age_in_years, data=amort2)
> 
> fit2

Call:  glm(formula = wealth_indicator ~ ordered(ranking) + age_in_years, 
    data = amort2)

Coefficients:
      (Intercept)  ordered(ranking).L  ordered(ranking).Q  ordered(ranking).C      age_in_years  
        0.0578500         -0.0055454         -0.0013000          0.0007603          0.0036818  

Degrees of Freedom: 39 Total (i.e. Null);  35 Residual
Null Deviance:      0.004924 
Residual Deviance: 0.00012      AIC: -383.2
> 
> fit3 <- glm(wealth_indicator ~ ordered(ranking) + age_in_years + ordered(ranking)*age_in_years, data=amort2)
> fit3

Call:  glm(formula = wealth_indicator ~ ordered(ranking) + age_in_years + 
    ordered(ranking) * age_in_years, data = amort2)

Coefficients:
                    (Intercept)                ordered(ranking).L                ordered(ranking).Q  
                      0.0578500                       -0.0018932                       -0.0039667  
              ordered(ranking).C                    age_in_years  ordered(ranking).L:age_in_years  
                      0.0021019                        0.0036818                       -0.0006640  
ordered(ranking).Q:age_in_years  ordered(ranking).C:age_in_years  
                      0.0004848                       -0.0002439  

Degrees of Freedom: 39 Total (i.e. Null);  32 Residual
Null Deviance:      0.004924 
Residual Deviance: 5.931e-05    AIC: -405.4

I am a bit confused by what ordered(ranking).C, ordered(ranking).Q and ordered(ranking).L mean in the output, and would appreciate some help in understanding this output, and how to use it to predict the response variable.

Best Answer

Try this:

fit <- glm(wealth_indicator ~ 
           factor(ranking) + age_in_years + factor(ranking) * age_in_years)

The factor() command will make sure that R knows that your variable is categorical. This is especially useful if your categories are indicated by integers, otherwise glm will interpret the variable as continuous.

The factor(ranking) * age_in_years term lets R know that you want to include the interaction term.