Solved – Regression Using Continuous Variable with Nulls

categorical datacontinuous datalogisticregression

I'm in a bit of a quandry with a logistic model I'm working on. As one of the explanatory variables, I want to include "Days since last visit" (or some transformation of it), however about 20% of the population I'm predicting for has never made a visit so their value for this would be Null.

Options I'm aware of:

  • Setting Nulls to -/+ ∞. I'd prefer to avoid doing this since it's essentially writing off the prediction for 20% of my population as 0 or 1

  • Adding a Categorical variable for "Is Null" and regressing with that as well. This seems like the most intuitive to me, but I'm still not sure on how I'd transform the model to capture the continuous element (i.e. difference between 1 days since visit vs 4 days since visit). My gut says there's a way to do this elegantly with an interaction variable but I'm not sure what that is.

  • Transforming Continuous Values into Categorical Groups (i.e. No Visit, 0-10 days, 11-20 days, etc.). Probably what I'm going with now, I'd like the ability of demarcating granular propensities with a continuous variable but this seems the most straightforward to me and I think the data would be too thin to transform days since last visit directly to categorical.

Thanks in advance for your thoughts!

EDIT 8/28: I wanted to update on the second point here what I had in mind, say I have a table of the following values. (for some reason my markdown tables aren't displaying)

Last Visit | Flag | Interaction

NULL | 0 | 0

5 | 1 | 5

NULL | 0 | 0

7 | 1 | 7

I would include the Flag and Interaction variable in my regression, with the hope that since Flag = 0 and Interaction = 0 have perfect collinearity, the values for the interaction terms coefficient estimate would be based solely on the non-null data.

Best Answer

By the way, the problem you have right now is the "missing data" problem. It can be such an issue that that's one of the reasons they do redundant checkups on you for like 5 things everytime you go to the doctor regardless of whether or not you feel it's related. No one wants to deal with the NULL monster.

It can be frustrating to deal with, but there are many options available...some more complex than others....some very, very complex....tons of books written on the subject too...it's not too bad though.

First, consider what your goal is....do you really need all the people whom have never made a visit in the model? They can really cause problems (probably because they likely have other data that is NULL as well...just a guess) if they arn't relevant to the information you're really seeking and especially if they have multiple NULL values...then I'd actually just delete them and perform a complex different regression/analysis with them. I'm going to assume you feel that they are important though as otherwise you would not have asked the question here including them.

As you have already recognized by asking the question...but I'm going to reiterate:

Null values or the lack of data should never be interpreted as a 0...it's a literal lack of information...it's a blank...if you were to measure something and it was "0" then that's data....that's information. If you weren't able to measure anything, there's no data or information.

1st) I have never heard of setting their values to +/- infinity. Maybe it's a technique I'm not aware of, but seems like it mess up alot of stuff.

2nd) I would DEFINITELY have a binary variable that indicates whether or not they have ever even visited. This is super important. The regression algorithm will likely assign some sort of weight via a coeffient * (1) or (0) to balance out the people who never visited to the ones who did. In fact, you may find that it is one of your most significant variables.

Although you could separate everything into buckets like you mention (it'd be easier seemingly)...but this would probably confuse the regressional alogrithm with the bucket containing the never-shows. You're also unnecessarily eliminating information by doing summarizing unique values into buckets. You're just throwing it all away...but anyways, here's what I'd do pending a few things:

You feel you have enough other data, say, in other columns, about these no shows which you'll be using in your regression. To be honest though, if they are no shows, then you likely have many more columns with the no shows's corresponding Null value for the matching...Null cells...if you do, then at this point this is over my head and I would just move them into a separate regression you study.

Null and measurement recorded are as if they are from different worlds. I would even study the Nulls on their own...if it mattered...depends on the context. Although NULL means a lack of data for a particular parameter, it may (likely) also be related to the value of many other values you have for them...

They are special cases.

Anyways, here's the start of the fancier and likely much better approach. Depending upon the statistics software you're using, it may already have a function where it does something like this...or does it more robustly...or uses a different method.

Regressional Imputation

By doing this, you're just allowing your regressional algorithm to perform better. You're not adding any value or additional information - which is something to keep in mind. You do incur a penalty though in terms of your margin of error...because you're not using real data...it's ...simulated...

So, you set the column where the cases exist with NULL as the explanatory value to your Y. You then run a regression with all the other cases and get a formula which you could then use to predict what their NULL value "would have been" if "it was there."

This only allows your regressional algorithm to function correctly and accurately though. If you do this, you definitely afterwards then need that additional column of a binary variable representing if they are a no show or not when you perform the "final regression."

For a fancier man... Multiple Regressional Imputation

You can also get fancier by 1st) predicting the NULL values by setting them to your Y and doing as explained above...and then you have a complete data set.

You take that complete data set and run a bunch of tests over it....such as a correspondance analysis and then other types that check how "complete" your data is...I'll link you to the site I'm referencing in a second..been a while since I did one of these.

I believe you actually make multiple data sets with your new simulated values in different ways (ie. using different variables each time you perform it yielding a different prediction value)

Then after measuring all your new data sets...you run tests which explains to you an analysis of your data...you then combine all of this data testing your none null data with the simulated data and hopefully figure out any, if which, other x variables/situations you can use to most accurately predict what the value would actually be.

This gets you closer than just a single regressional imputation as you're trying to adjust for the error penalty you're incurring by studying relationships to the NULL.

Yeah, so it can get messy.

Here is a nice summary of all the techniques for treating NULL values as well as how to recognize the TYPE of null value you have and what to do with that then...https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/

Related Question