Binary Logistic Regression in SAS – Assigning Values to Missing Data

logisticmissing datamodelingregressionsas

Many of the variables in the data I use on a daily basis have blank fields, some of which, have meaning (ex. A blank response for a variable dealing with the ratio of satisfactory accounts to toal accounts, thus the individual does not have any accounts if they do not have a response in this column, whereas a response of 0 means the individual has no satisfactory accounts).

Currently, these records do not get included into logistic regression analyses as they have missing values for one or more fields. Is there a way to include these records into a logistic regression model?

I am aware that I can assign these blank fields with a value that is not in the range of the data (ex. if we go back to the above ratio variable, we could use 9999 or -1 as these values are not included in the range of a ratio variable (0 to 1)). I am just curious to know if there is a more appropriate way of going about this. Any help is greatly appreciated! Thanks!

Best Answer

In general, dealing with missing input values is always problematic. To my best knowledge, none of the existing methods can deal with it without introducing some bias to the model, so you have to consider this during your research. There are at least few possible options:

  • ignore data with missing values (which I do believe you do now), which is the "safest" option, but can lead to insufficient data being left to train a good model
  • fill missing values with some statistical analysis of the data - for example:
    • mean value of the particular feature/dimension (for real valued variables)
    • median value of the particular feature/dimension (for categorical ones)
  • train a separate model to predict a missing value, e.g. let's imagine data in $X^k$, and each of the dimensions can have missing inputs, then you can create $k$ models $M_i$, each for predicting the $i$th dimension using the rest of them, so $M_i : X^{k-1} \rightarrow X$, and you use it to preprocess your data
  • use some generative model, that can fill missing values by itself, one possibility is a Restricted Boltzmann Machine

As was previously stated, each method introduces some bias to the analysis (which has been proven in many papers, for many models), but it can also help you build a better model: everything depends on your data.

EDIT (after clarification)

A missing value of some $i$th feature/dimension $f_i \in X$ is lack of observation/knowledge about what particular value $x\in X$ does it have. One can imagine a situation where we are asking people to fill out a multi-page survey, and after getting all the data it turns out we do not have one of the person's pages. We do not know what was his/her response, but we are quite sure there was one. On the other hand a person could give as a blank question (without an answer) or write something like "I will not answer this question", which is not missing information; in fact this is as informative as selecting one of the predefined boxes. In such a scenario we simply have a categorical feature, $f'_i \in X \cup \{ \emptyset \}$. We can either express it as a multi-valued feature, or encode it in unary form by replacing $f'_i$ with $|X|+1$ new binary features $f''_{ij}$ for each $j\in X \cup \{ \emptyset \}$ such that $f''_{ij} = 1 \iff f'_i = j$. Choice between these methods is model- and data-dependent.