Missing Data – Handling Missing Data with Splines

generalized linear modelgeneralized-additive-modelmissing datasplines

I am modeling data with a gamma response. Two continuous variables in my data set are nonlinear and have a large number of nulls. One option I see is to bin/discretize the variables where the nulls would be in their own bin and fit a glm. I am not a proponent of binning continuous data due to the loss of information. I would rather impute the missing data and fit a spline with a generalized additive model. However, for business reasons, I cannot impute the missing data for this model. Does anyone have ideas on how I could include the nulls as null values with a spline?

Best Answer

One strategy that can sometimes work is to set missing values to, say, the mean or median of the observed values and to add a flag for whether values were missing. E.g. the following:

Record id Value
1 8
2 4
3 3
4 missing
5 5
6 9
7 missing
8 1
9 missing
10 7
11 2
12 missing
13 10
14 6

becomes

Record id Value Missing flag
1 8 0
2 4 0
3 3 0
4 5.5 1
5 5 0
6 9 0
7 5.5 1
8 1 0
9 5.5 1
10 7 0
11 2 0
12 5.5 1
13 10 0
14 6 0

Depending on your model class this comes with a bunch of assumptions and limitations. For example, if you use a kind of linear model without interactions, then you are basically forcing the model to make all the missing values fit into your spline for the variable at the value of 5.5 and only allow a factor variable for the missing flag to shift this up or down by a fixed amount. If you think that other variables tell you a lot about this missing values, then that may be non-ideal and you could try to do something like interactions between these variables and the missingness indicator.

I kind of find it hard to understand your limitation due to business reasons that exclude the possibility of a more sophisticated imputation. A off-the-shelf multiple imputation could be a reasonable option (although most approaches would by default not allow for non-linearities, but models can be expanded to cover that, but that becomes a rather bespoke solution that can be tough to implement).

Technically, you could also fit models that also model the covariates and thus, can handle missingness in them (e.g. the brms R package handles this quite nicely from a Bayesian perspective), but this is also on the more complex and computationally challenging side.