Solved – Estimating bounds on false positives rate

classificationestimation

I would like to estimate bounds on the false positive rates of a binary classifier. In my sample data I have 50% positive data points, and 50% negative data points. However, in the real data, which I don't have access to, I can estimate that there are going to be $N$ positive samples and $N^2$ negative samples, where $N$ is large, in the order of millions. Since I am looking for a needle of size $N$ in a haystack $N^2$, it is very important that my false positive rate remains as close to zero as possible.

I have 40 thousand positive samples and 40 thousand negative samples. Up until recall 0.8, I have a false positive rate of 0. I would like to use this to estimate the real false positive rate. I can model the false positive rate as a probability of labeling a Negative sample as Positive. Let's call this $P_{np}$ (for negative -> positive). I don't know its true value, but I do know that after labeling 0.8*40000 I have 0 false positives. The number of false positives, depends on the value of $P_{np}$ and should be binomially distributed. Assuming this is true, I can estimate a confidence interval around my empirical estimation of $P_{np}$. Does this make sense? Can you point me to relevant work in the literature?

Best Answer

Call your false positive rate p, the actual negative rate in your sample r, and n the number of trials without coming across a false positive. Clearly for your confidence interval, given you've observed no false positives so far, the lowest bound (and indeed the maximum likelihood estimate) is zero. For the higher bound, you can think "what is the value of p for which there is $\alpha$ probability of getting 0 failures out of my n trials? Higher values of p are not in your $1-\alpha%$ confidence interval.

The probability of a false positive for any one draw in your sample experiment is $p\times{r}$.

So solve $\alpha=(1-pr)^n$

and you get $p=\frac{1-e^{\frac{log(\alpha)}{n}}}{r}$

With your n=32000 and r=.5 (if I understand your question correctly) this suggests the upper bound of a 95% confidence interval for false positives is 0.0001872245.

Best Answer

Related Solutions

Logistic Regression – Maximizing True Positives Minus False Positives

Solved – how to handle (many) false positives in training dataset for logistic regression classifier

Related Question