Missing Data – How to Handle Non-Existent (Not Missing) Data

missing data

I've never really found any good text or examples on how to handle 'non-existent' data for inputs to any sort of classifier. I've read a lot on missing data but what can be done about data that cannot or doesn't exist in relation to multivariate inputs. I understand this is a very complex question and will vary depending on training methods used…

Eg if trying to predict laptime for several runners with good accurate data. Amongst many inputs, possible variables amongst many are:

  1. Input Variable – First time runner (Y/N)
  2. Input Variable – Previous laptime ( 0 – 500 seconds)
  3. Input Variable – Age
  4. Input Variable – Height
    .
    .
    . many more Input variables etc

& Output Predictor – Predicted Laptime (0 – 500 seconds)

A 'missing variable' for '2.Previous laptime' could be computed several ways but '1. First time runner' would always equal N . But for 'NON EXISTENT DATA' for a first time runner (where '1. First time runner' = Y) what value/treatment should I give for '2. Previous laptime'?

For example assigning '2. Previous laptime' as -99 or 0 can skew the distribution dramatically and make it look like a new runner has performed well.

My current training methods have been using Logistic regression, SVM, NN & Decision trees

Best Answer

Instead of assigning special value for non-existent first time runner previous lap time, simply use interaction term for previous lap time with the inverse of first time runner dummy:

$$Y_i=\beta_0+\beta_1 FTR_i+\beta_2 (NFTR_i)\times PLT_i+...$$

here

  • $Y_i$ is your input variable,
  • $...$ is your other variables,
  • $FTR_i$ is dummy for the first time runner,
  • $PLT_i$ is the previous lap time and
  • $NFTR_i$ is dummy for non first time runner equaling 1, when $FTR_i=0$ and 0 otherwise.

Then the model for first time runners will be:

$$Y_i=(\beta_0+\beta_1) + ...$$

and for non first time runners:

$$Y_i=\beta_0+ \beta_2 PLT_i + ...$$

Related Question