Regression – How to Rank and Predict an Outcome with or without a Machine Learning Model

classificationlogisticmachine learningneural networksregression

I am a novice data scientist and am currently working on a simple ML project

I have a dataset that looks like below

We derive the outcome label based on the rule that `if booked qty is more than 50% of target final qty, we consider it as positive. else, negative.

The ML model will only be using features from feat_1 to feat_n.

Target final Qty and Booked Qty are only used for deriving labels (and will not be used in model building)

Now I have two business objectives

a) let's say I would like to select all negative items and follow up with the customers (steer them towards booking more order quantities). However, instead of reaching out to both Req_id = 2 and req_id=4, I would like to focus on req_id that has more chance of becoming positive. I would like to have some likelihood or ranking measure based on input features that can help save our resources. What I have shown is just a sample. IRL, there might be thousands of records. Is there anyway to rank these rule based labels? is there any simple statistical or ML approach that you can suggest to rank the labels (even before the prediction for an unseen datapoint)?

b) I know we can get likelihood for new data points if I use logistic regression. But how can I rank the existing labels itself (with/without ML model). Is it even possible?

c) As you can see am generating labels based on a rule. In that case, do I even need ML for this problem? For ex: If a new/existing customer shares an requirement (req_id) with us indicating the target final qty of a product and that he will book/purchase on a certain date later, we would like to know whether he will meet his target final Qty or not based on his input characteristics etc. Currently, we see most of the customers don't meet their target final qty. So, we would like to know whether they will meet(atleast 50% of) the target qty. I think the prediction will be useful in the time interval between req stage and order booking stage.

d) Or is there any other ML method or simple statistical method that you would suggest for this problem? Help please

Can help me with the above please?

Best Answer

a) The probability values from a well-calibrated$^{\dagger}$ model will be your friend. If you have the budget to contact $500$ people, contact the $500$ people most likely to respond. Many machine learning methods output probability values that one can map to a discrete outcome by using a threshold, but doing so has major drawbacks, such as not being able to identify the most likely customers. This gets at the lift curve mentioned in the comments to your question (which also is mentioned in the "major drawbacks" blog post I linked).

b) You can plug in your values to the logistic regression you would fit. Example:

set.seed(2022)
x <- c(
60/100,
10/200,
40/50,
20/80,
750/1000
)
y <- c(1, 0, 1, 0, 1)
L <- glm(y ~ x, family = binomial) # This is the logistic regression
round(1/(1 + exp(-predict(L))), 2)

The caveat is that you have perfect separation, because you are creating the labels based on a rule: if x > 50, then positive, else negative.

c) Rule-based methods might work fine, but you need to come up with the rules. The appeal of machine learning is that the machine learns the rules from the data, rather than being explicitly told what the rules are.

d) I don't think you have the data to do a machine learning problem, and it does not seem like what you're doing makes sense for machine learning. What you would want for machine learning is some measured value that you want to predict in the future. For instance, based on historical records of target quantity and booked quantity, what were the outcomes? It might turn out that $50\%$ is an interesting point, but probably not everyone above $50\%$ turned out positive and not everyone below turned out negative.

If you just want to assign people to "positive" and "negative" classes based on the booked:target ratio, then you always know the outcome, and you know the outcome with certainty. Even in a situation where you achieve perfect separation in your model, it is conceivable that there is some small chance that a point could wind up on the wrong side of the fence (a positive case in the realm that you thought was exclusively negative, for instance) but simply did not. However, you absolutely know the positive/negative outcome once you know the booked:target ratio.

$^{\dagger}$You actually don't need calibration of the probabilities, but the right order. The top five people are the top five, whether their probabilities are $0.9, 0.85, 0.8, 0.75, 0.7$ with the others below $0.69$ or $0.9, 0.55, 0.5, 0.35, 0.2$ with the others below $0.19$. What calibrated probabilities could give you, however, is the ability to spend less than your allocated budget. Perhaps you have the budget for $500$, but after the first $400$, there is a steep drop in probability. You might be able to save $20\%$ of your funds without taking much of a hit in quality.

Best Answer

Related Solutions

Machine Learning – Is Creating Feature Using Outcome Label Data Leakage?

Rule-Based Label – Random Split vs Time-Based Split

Related Question