Solved – Logistic Regression with empty cells

logisticmachine learningregressionsupervised learningtrain

I have a data set from which I need to train a model and use it for prediction.
Let's say I want to predict what people say about food items produced by a cake shop. Let's assume people have stated the taste of the food items as good, average and bad in different years, by different methods. So the columns will be:

  • Food item
  • No. of people saying Good
  • Year it was said Good
  • Method of stating Good (By tasting or by rumours)
  • No. of people saying Bad
  • Year it was said Bad
  • Method of stating Bad (By tasting or by rumours)
  • No. of people saying OK
  • Year it was said OK
  • Method of stating OK (By tasting or by rumours)

In this case, there will be food items which people have stated as Good but no mention whether it is Bad or Ok. In such cases, the respective columns will have to be kept empty. For "the no. of people …." columns, zero can be added but not for "year …" column and "method…." column.

If I use Logistic Regression to train this data set, will it be valid since there will be no.of empty cells in the training data set? Or else, if Logistic Regression is not good in this scenario, what supervised machine learning method can be used?

Best Answer

If I understand correctly what you are aiming for, you might want to rearrange your data a little. You have for each year set of explaining variables in your example:

  1. Number of people who said it was good based on rumours
  2. Number of people who said it was ok based on rumours
  3. Number of people who said it was bad based on rumours

And the same for those where it was based on tasting. Which makes six columns for each year and if there were none saying something the value is zero. So you would have six times the years as input in this case and what you are predicting is 1,0 for profit/no-profit.

What it means is that "year" as such is not a predictor, but "number of people who did A in year X" is.

Related Question