Solved – Relation between precision, recall and sample size

descriptive statisticsmachine learningprecision-recallsample-sizesampling

I have a large data set for binary classification problem. Now in order to fit model to data I have been trying modeling using various sample size. For each sample size I gets a different precision recall values. One way to select the model based which has high precision or recall having sample size greater than certain threshold. How to determine this threshold of sample size or in general what is relation between precision, recall and sample size.

Best Answer

You can calculate the threshold on sample size for given precision/recall assurances. Refer this article: Statistics: An Introduction to sample size calculations

Edit:

As documented in the linked article,

There are two approaches to sample size calculations:

  • Precision-based: With what precision do you want to estimate the proportion, mean difference . . . (or whatever it is you are measuring)?
  • Power-based: How small a deviation from hypothesis is important to detect and with what degree of certainty? The smaller the difference you regard as important to detect, the greater the sample size required.

Suppose you want to be able to estimate your unknown parameter with a certain degree of precision. What you are essentially saying is that you want your confidence interval to be a certain width. In general a 95% confidence interval is given by the formula:

Estimate ± 2(approx) × SE

where SE is the standard error of whatever you are estimating. 95% confidence intervals are usually based on the normal distribution or a t-distribution; for a normal distribution the value is 1.96; for t-distributions the value is generally just over 2.

The formula for any standard error always contains n, the sample size. Therefore, if you specify the width of the 95% confidence interval, you have a formula that you can solve to find n.


Power-based sample size calculations relate to hypothesis testing.

As a matter of good scientific practice, a significance level is chosen before data collection and is often set to 0.05 (5%). This significance level, denoted by α, represents the conditional probability of type I error. enter image description here

Suppose you want to compare the mean in one group to the mean in another (i.e. carry out an unpaired t-test). The number, n, required in each group is given by

n = f(α, β) · 2s^2/δ^2

Where: α is the significance level (using a two-sided test) — i.e. your cut-off for regarding the result as statistically significant.

1 − β is the statistical power of your test.

f(α, β) is a value calculated from α and β — see table for f(α, β) in article attached.

δ is the smallest difference in means that you regard as being important to be able to detect.

s is the standard deviation of whatever it is we’re measuring — this will need to be estimated from previous studies.

Similar formulae can be obtained for other types of analysis by reference to appropriate texts.