R – Best Ways to Handle NA Values in Regression Predictor Variable

censoringmissing datamultiple regressionrregression

I am running a multinomial logistic regression model in R (using the multinom function from the nnet package) with a set of 12 predictor variables. Some of these variables measure the time duration from a reference point until the occurrence of a particular event in experimental trials. In instances where the specified event did not occur, an na has been entered into the table. This affects a significant number of observations.

My understanding is that na values should be appropriately treated prior to running a regression model. But in this case it is not clear to me how best to handle the "missing" data. Setting them to zero implies that the event happened instantaneously, when in reality it has not happened at all. Replacing na values with mean/median is also unsatisfactory, as the model should capture non-occurrences of specific events. Removing these observations from the model excludes too many data points.

What would be the "best" way to proceed in this case?

Best Answer

I agree with Tim that there is no "one size fits all" approach. For example, one must consider the implications of a "not applicable" response for the variable, one's population of interest, and one's research questions. For example, in some cases, a N/A response may indicate that the respondent is not a member of the population from which one wishes to sample. In some cases, an N/A may indicate either that an event will happen but hasn't been measured or that it will never happen. It seems as though one's solution might depend on how one answers such questions.

That said, I've seen some posts (like this one or this one) where the following strategy (or a very similar strategy) is recommended for managing right censored predictors:

  1. Create a variable (CENSORED) and code it as 1 if the event did not occur, and 0 if the event did occur.

  2. Recode the time variable to indicate the amount of time between when the event occurred and when the measurement period ended (TIME2END). Thus, if the event occurred 100 seconds before the measurement period ended, the person would get a score of 100. If the event didn't occur before the period ended, the person would get a score of 0.

These two variables would then both be included in your regression, and should provide reasonably complete info about your predictor.

Jon