Solved – Logistic Regression with empty cells

I have a data set from which I need to train a model and use it for prediction.
Let's say I want to predict what people say about food items produced by a cake shop. Let's assume people have stated the taste of the food items as good, average and bad in different years, by different methods. So the columns will be:

Food item
No. of people saying Good
Year it was said Good
Method of stating Good (By tasting or by rumours)
No. of people saying Bad
Year it was said Bad
Method of stating Bad (By tasting or by rumours)
No. of people saying OK
Year it was said OK
Method of stating OK (By tasting or by rumours)

In this case, there will be food items which people have stated as Good but no mention whether it is Bad or Ok. In such cases, the respective columns will have to be kept empty. For "the no. of people …." columns, zero can be added but not for "year …" column and "method…." column.

If I use Logistic Regression to train this data set, will it be valid since there will be no.of empty cells in the training data set? Or else, if Logistic Regression is not good in this scenario, what supervised machine learning method can be used?

Best Answer

If I understand correctly what you are aiming for, you might want to rearrange your data a little. You have for each year set of explaining variables in your example:

Number of people who said it was good based on rumours
Number of people who said it was ok based on rumours
Number of people who said it was bad based on rumours

And the same for those where it was based on tasting. Which makes six columns for each year and if there were none saying something the value is zero. So you would have six times the years as input in this case and what you are predicting is 1,0 for profit/no-profit.

What it means is that "year" as such is not a predictor, but "number of people who did A in year X" is.

Best Answer

Related Solutions

Solved – Measuring accuracy of a logistic regression-based model

Solved – How to account for a nonlinear variable in a logistic regression

Related Question