Solved – Handling NAs in a regression ?? Data Flags

missing dataregression

I am right now working with a big data set with about 30 different variables. Almost all of my rows have a missing value in at least one of the rows. I would like to run a regression with several of the variables. From my understanding of R (or any other stats programm) it will drop any observations that have at least one NA in the variables. Is there a way to stop R from doing that? I mean is it possible to let R ignore the missing values but still run the regression on the remaining ones?

One of my professors once told me that it is possible to use "data flags" so to create dummies that are equal to 1 when the value is NA and zero otherwise. I would create those flags for every variable with NAs. And then I set the NAs to zero, afterwards I can just include the flags in the regression. Thats what I was told if I remeber correctly. I now wanted to google this procedure but I could not find anything. I this a legit approach? Are there any risks or other problems?

If so is there another solution? I know about imputation and interpolation, which I can use for some of my variables, but not for all.

Just to make that clear, I do not have any NAs in my dependant variable.

Best Answer

The "flagging method"—often called the "dummy variable method" or "indicator variable method"—is used mostly to encode predictors with not applicable values. It can be used to encode predictors with missing values; when you're interested in making predictions for new data-sets rather than inferences about parameters, & when the missingness mechanism is presumed to be the same in the samples for which you're making predictions.

The problem is that you're fitting a different model in which the non-missing slopes don't equate to the "true" slopes in a model in which all predictors are non-missing. See e.g. Jones (1996), "Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression", JASA, 91, 433. (An exception is in experimental studies in which predictors are orthogonal by design.)

Note that you can set the missing values to an arbitrary number, not just zero, for maximum-likelihood procedures.

† Suppose the model of interest is

$$\eta=\beta_0 + \beta_1 x_1 + \beta_2 x_2$$ where $\eta$ is the linear predictor. Now you introduce $x_3$ as an indicator for missingness in $x_2$: the model becomes

$$\eta=\beta'_0 + \beta'_1 x_1 + \beta'_2 x_2 + \beta'_3 x_3$$

When $x_2$ is not missing you set $x_3$ to $0$: $$\eta=\beta'_0 + \beta'_1 x_1 + \beta'_2 x_2$$

When $x_2$ is missing you set $x_3$ to $1$ & $x_2$ to an arbitrary constant $c$: $$\eta=\beta'_0 + \beta'_1 x_1 + \beta'_2 c + \beta'_3$$

Clearly when $x_2$ is missing, the slope of $x_1$ is no longer conditional on $x_2$; overall $\beta'_1$ is an average of conditional & marginal slopes. In general $\beta'_1 \neq \beta_1$.