Many of the variables in the data I use on a daily basis have blank fields, some of which, have meaning (ex. A blank response for a variable dealing with the ratio of satisfactory accounts to toal accounts, thus the individual does not have any accounts if they do not have a response in this column, whereas a response of 0 means the individual has no satisfactory accounts).
Currently, these records do not get included into logistic regression analyses as they have missing values for one or more fields. Is there a way to include these records into a logistic regression model?
I am aware that I can assign these blank fields with a value that is not in the range of the data (ex. if we go back to the above ratio variable, we could use 9999 or -1 as these values are not included in the range of a ratio variable (0 to 1)). I am just curious to know if there is a more appropriate way of going about this. Any help is greatly appreciated! Thanks!
Best Answer
In general, dealing with missing input values is always problematic. To my best knowledge, none of the existing methods can deal with it without introducing some bias to the model, so you have to consider this during your research. There are at least few possible options:
As was previously stated, each method introduces some bias to the analysis (which has been proven in many papers, for many models), but it can also help you build a better model: everything depends on your data.
EDIT (after clarification)
A missing value of some $i$th feature/dimension $f_i \in X$ is lack of observation/knowledge about what particular value $x\in X$ does it have. One can imagine a situation where we are asking people to fill out a multi-page survey, and after getting all the data it turns out we do not have one of the person's pages. We do not know what was his/her response, but we are quite sure there was one. On the other hand a person could give as a blank question (without an answer) or write something like "I will not answer this question", which is not missing information; in fact this is as informative as selecting one of the predefined boxes. In such a scenario we simply have a categorical feature, $f'_i \in X \cup \{ \emptyset \}$. We can either express it as a multi-valued feature, or encode it in unary form by replacing $f'_i$ with $|X|+1$ new binary features $f''_{ij}$ for each $j\in X \cup \{ \emptyset \}$ such that $f''_{ij} = 1 \iff f'_i = j$. Choice between these methods is model- and data-dependent.