Solved – Training data has categorical target variable, but I want to get numeric target variable for new samples

algorithmscomputational-statisticsmachine learningneural networksregression

I hope I am asking the question to the right section of StackExchange.

I have data which consists of 5 independent variables, and 1 dependent (target) variable. The independent variables are numeric, and the dependent variable is categorical (eighter dead or alive). A part of data set is in the following image.

enter image description here

I know if I use machine learning algorithms trained with the sample data, then I can predict target variable with the same representation (either dead or alive). However, I don't want to get the target variable in a binary or categorical format, instead, I want to predict the result in percentage (as continues variable). For new samples, would that be possible to get results like the following image (the target variable is the percentage of being alive). I will appreciate if you have any suggestion, either machine learning algorithms, or a statistics method/algorithm etc.
If anything is not clear, please let me know. Thank you.

enter image description here

Best Answer

Interesting question and I would suggest to spend more time on problem formulation / to shape the data in the right way.

At the first glance, it seems impossible task. How can you get more information (a percentile number) from a limited data set (binary label)? However, if you have a right assumption / parametric form of the independent variables, the task is possible.

I will demonstrate the idea by couple examples.

  • Let us start with a slightly different data: Suppose we have 4 binary input variables. Then we can divide data into $2^4=16$ "groups", and if we have many data points to train the model, we achieve the task, i.e., in each group, there are sufficient data to learn the percentile.

  • Now, if we have 4 categorical variables, and each one has 5 possible values. Number of grows would be $5^4=625$ groups, we need much more data to learn the percentile in each group.

  • In your example, if 4 input variables are continuous. Then there are "infinite number of groups", how can we learn?

One way of doing it is put assumptions to the model and have some parametric form of the distribution. For example, we can assume the inputs are 4 dimensional Gaussian. With the assumption, we have limited "degree of freedom", and it is possible to learn the percentile with limited data.

Related Question