Solved – Is it better to replace missing values by mean or mean by class

data-imputationmissing datascikit learn

I am preparing a dataset for a binary classification. Some of the columns contain missing values.

I am thinking of either handling the missing values manually using pandas and replace each missing value by the mean of the column by class or using the Imputer method by scikit.

Two questions, is it better to replace the missing values with the mean of the class rather than the mean of the whole column and if yes why isn't there such an option in scikit's Imputer method?

some data related info

For some features (sentiment analysis features) there a lot of missing values and that comes from the fact that some users (my training samples are pairs of users) don't share common hashtags (it's twitter a analysis task), but most of the features are topological so it doesn't make sense to discard all those lines.

On the other hand someone could say that since the users don't share any hashtags they should have a zero value for those features. But this isn't necessarily true. My crawling includes the latest 500 tweets for those users, not all tweets they have ever produced.

Also from visualizing the data I can see clear correlations between those features and the class values, from the values that are not gathered around zero due to lack of common hashtags.

Here are some visualizations

Here are two sentiment features, not exactly easily linearly separable but the problem can be observed (not exactly seen because of density), the stripe on the crossing of the axes holds close to half the values

Here is another one, the problem is the same, the small stripe on the crossing of the axes holds close to half the values
enter image description here

Here is the same visualization but from weka this time

enter image description here

Best Answer

Some questions I would ask if I could explore the data:

  • Are the missing values somehow correlated to a feature? Does it make sense instead to create a new binary feature tracking whether a given feature is missing?
  • How many observations are missing feature values? Do I have enough data to simply discard all observations that are not complete? If I regularize a model with the "missing data" binary feature I mentioned earlier, does regularization push this feature out? (If so, you can probably drop the observations without issue).
  • Try clustering the various features with missing values according to output label. Is there clear separation between the classes? If so, you may want to impute using a class-level mean instead of a mean across all observations.

Your intuition is right that one often imputes missing data by using class-level means instead of sample-level means, but that might not be necessary here.

Related Question