I am planning to conduct a A/B test with data obtained through a deep learning algorithm. Say, I got a binary classification dataset through machine learning with about 100k rows classified into yes, or no. But, the accuracy of prediction in the classification process was 90%, which may mean that there are still falsely labelled rows such as false yes and false no.

If I run a A/B test with this dataset, say trying to find out which one's more profitable between old and new stuff, how do I deal with the 10% possibility of false yes or no statistically?

To get more accurate result from A/B test, there should be a way to mitigate the influence of the 10% of wrong predictions the dataset has, mathematically or statistically.

Please, let me know the name of a methodological way or any relevant info. so that I can search and study for myself. Or, easy and thorough explanation of them would be really thankful.

ex)

<labelled dataset I got from a sentiment labelling model with 90% accuracy>

review | sentiment | (accurate or not) | date |
---|---|---|---|

review1 | positive | (accurate) | 01-01-2022 |

review2 | positive | (inaccurate) | 01-01-2022 |

review4 | positive | (accurate) | 01-02-2022 |

review5 | positive | (accurate) | 01-02-2022 |

Using the above dataset, I want to split it into positive reviews on 01-01-2022(control) and positive reviews on 01-02-2022(experiment) and see if a new campaign changed the user's sentiment. To skip all and put it simply, if I don't do anything about false positive generated by the predicting model in the sentiment analysis process, the result would be "no change after the campaign" because the rate of positive reviews are same in both 01-01-2022 and 01-02-2022 cases, ignoring review2 is in fact a negative one.

When a dataset like the above is so large that I cannot manually check if an individual review is correctly labelled, and if I just run an AB test with the dataset, those wrongly labelled reviews, like the review2 in the example, will somehow affect the test result, I guess.

What should I do to mitigate the impact from those wrongly labelled reviews? Should I just avoid doing AB test with a dataset gotten by any deep learning classification model like one for sentiment analysis? (because it's impossible to obtain 100% accuracy in any deep learning model) Or, is there mathematic or statistical method to mitigate it?

Hope this makes clear what I am asking.

## Best Answer

## About the general setup of the A/B-Test

First of all, an A/B-Test is per definition a randomized experiment, i.e. users / visitors are assigned certain treatments / groups at random. To measure the impact of the treatments, everything is kept the same except the variation of the treatment.

Regarding selection of users for the A/B-TestWhen only a subset of users is selected, the impact of the treatment can only be measured for that subset. This is valid from a practical point of view if the treatment is only available for the subset.

Otherwise a regular A/B-Test for all users is performed, followed by an analysis to see how the subgroups of interest have been affected. This is called cohort analysis.

Applied to your case:Is the new campaign / treatment only available to those with a positive sentiment ? Has this campaign the potential to change the profitability or sentiment of the users with a negative sentiment ? Are these changes potentially interesting for business ? If you answer one of this questions with yes or are not sure, I strongly recommend to perform the test on all users accompanied by a cohort analysis.

Regarding assignment of users to groupsIf the assignment of users to groups is not happening at random, be aware that you introduce potential confounding factors. It is best to avoid this or make sure, that no bias is introduced this way.

Applied to your case:Are you sure, that group assignment based on review date

01-01-2022and01-02-2022is a random assignment ? Differences like "national holiday / weekend vs working day" or "sunny vs cloudy weather" can influence the basic sentiment of the users and hence affect the base probability of leaving a positive review.Keep in mind, that the result of such an A/B-Test is used to make general statements about the usefulness of the new campaign / treatment. Is the validity of this general statement affected if only users with certain review dates are compared ?

I strongly recommend to assign treatments at random whenever possible.

## About the uncertainty of the classification model

Ifthe classification error rate is INDEPENDENT of the group assignment, then both groups are affected equally. The variance induced by these misclassifications is "captured" by the subsequent statistical test, e.g. the G-Test or t-test.BUTIf the classification error rate of the model affects the measured metric (e.g. change of sentiment) and the error is to high compared to the effect size of this metric, then it might be that the effect is "drowned in noise" and hence the statistical test used for evaluating the A/B-Test result will show no significant difference.

You can simulate such scenarios by using the Monte Carlo method. Despite its fancy name, it is nothing more than repetitive random sampling from defined distributions, calculating the desired function / outcome and hence obtaining an distribution for said outcome.

How do I know whether it is independent ?In your particular case I would assume independence if the classification error is roughly the same for both review date

01-01-2022and01-02-2022. This can be checked by a manual analysis of a sample. Again a statistical test can be performed to see whether there is an significant difference. If you get rid of this group determining factor (see section about confounders above), you can assume independence.Is it possible to just "substract" the missclassifications from the final result ?No, since we do not know which instances have been misclassified. But one can apply basic probability calculations (or Monte Carlo) to estimate the worst / best / average case effect of the classification error rate on obtained results.