From a response to comment, we can adopt an urn model. The urn contains 100,000 balls representing all cases. An unknown number of these are black ("invalid"); they are of no interest. We are interested solely in the non-black balls in the urn. Of those, some are of color "A" and others of color "B". The main research question appears to be "what proportion of the balls of interest are A's?"
This urn model says option (2) is the one to use.
A simple random sample (without replacement) of 2,000 balls from this urn yielded 1000 black balls, 300 A's, and 700 B's, for n = 1000 A's & B's. The rest is routine. In particular, the distribution of A's (conditional on a non-black ball being drawn) is Binomial(p, 1000). A standard estimate of p is #A's / (Total A's & B's) = 30%. The estimated variance of the total is p(1-p), whence the variance of the estimated proportion of A's equals p(1-p)/n = 0.00021. Its square root, 1.45%, is the standard error of estimate of p. Because the numbers of A's and B's are large, yet are small compared to the expected number of non-black balls (about 50,000), it is appropriate to use normal-theory confidence intervals and to ignore the correction for sampling without replacement. (The correction shrinks the confidence interval to 0.99 times its width.) A 99% two-sided confidence interval therefore extends 2.58 * 1.45% = 3.73% to either side of the estimated proportions E.g., a confidence interval for the proportion of A's (out of all the A's and B's in the urn) extends from 26.27% to 33.73%.
If you are uncomfortable using conditional probabilities (which is at the root of this analysis), you can estimate the contents of the urn (i.e., total numbers of black balls, A's, and B's) using the multinomial distribution. You will get exactly the same results, because in the end you care only about the proportion of A's relative to the numbers of A's and B's, so all estimates involving the number of black balls never enter the calculation.
Another way to get some intuition is to recognize that (except for the tiny correction term being neglected here) the size of the confidence interval depends only on the observed numbers of A's and B's and not on the number of balls in the urn. That's why there's no concern here about whether the "population" is 50,000 or 100,000.
An auxiliary research question seems to be to estimate the total number of A's and B's in the urn. For this purpose the urn contains only two kinds of balls, black ones and non-black ones, and we want to estimate the number of non-black balls. This is a standard binomial sampling situation. Without more ado, the estimated number of non-black balls equals 100,000 * (1000/2000) = 50,000 and the estimated proportion is 1/2, with standard error $\sqrt{(1/2)(1 - 1/2)/2000}$ = 1.1%. Therefore the estimate of 50,000 has a 99% two-sided confidence interval from 48,560 to 51,440.
If one wants to perform an A/B-Test with a small baserate not just for funsies, one has to ask what effect size i.e. which absolute improvement is considered to be worth the effort.
For example, if p=1/10^6 and number-of-visitors-per-month=10^6, then even an relative improvement of 500 % means an absolute improvement of 4 more conversions on average. If such differences cannot be justified with monetary arguments (e.g. the website is selling trips to space), an A/B-Test is not worth the trouble.
However, if such differences are considered to be worth the effort, I suggest to break down / decompose the conversion-rate into the participating factors. For example, let's say that one measures conversions as $\frac{boughtSpaceTrips}{siteVisitors}$. This rate can be splitted into ...
$\frac{boughtSpaceTrips}{siteVisitors} = \frac{boughtSpaceTrips}{spaceTripsInBasket} * \frac{spaceTripsInBasket}{siteVisitors}$
This decomposition may allow one to detect differences in a decomposed ratio, which do not appear in the composed ratio because they are countered by the other ratios (negative correlation) or have such a small contribution weight, that it requires the mentioned ton of data to do so. Whether there is some sort of negative correlation between the decomposed factors can be decided by applying domain knowledge, for example, how much does it "cost" for the user to perform a certain action.
In the given constructed example, the reasoning
Improve $\frac{boughtSpaceTrips}{spaceTripsInBasket}$ => Improve $\frac{boughtSpaceTrips}{siteVisitors}$
is valid, but the other way around
Improve $\frac{spaceTripsInBasket}{siteVisitors}$ => Improve $\frac{boughtSpaceTrips}{siteVisitors}$
is not.
If the decomposition does not lead to more feasible base rates, then take a look at the statistical discipline for this kind of problem (keyword: "rare event(s)"). But in this case you go beyond the scope of normal A/B-Tests, so I would ask again, whether this is worth the effort. Aside, my intuition tells me that one cannot avoid the pillars of the universe, so rare events still require a lot of data (but maybe not a ton), no matter which fancy method is applied (domain knowledge may help a lot though).
Best Answer
Ok, so this answer might not be exactly what you were after based on the detail of your question, but I stumbled across your question based on just the title and so this might help other people who also come across it in a similar fashion.
The only way I know of determining sample size using a bootstrap is via a power analysis approach. That is you:
With many possible "variations on a theme of..."
And that gives you the statistical power (for that sample size and that particular test), because the definition of statistical power is "probability that the test will reject the null hypothesis when the alternative hypothesis is true". So you can then vary the sample size until you achieve the desired power.
Here's an approach in R that I did based on this paper, Sample Size / Power Considerations, by Elizabeth Colantuoni.
I had two groups of non-normal, non-parametric data. A pilot study of each showed them to have differing medians and a Mann Whitney Wilcoxon test rejected the null hypothesis that they were the same, but I wanted to determine the sample size required so I could say this for "sure". Since the test already rejected the null hypothesis on the pilot data I did not see any need to shift or manipulate the data to ensure the alternative hypothesis was true.
Necessary disclaimer: I'm not a statistician and I'm still learning about bootstrapping so feedback, corrections and pointing and laughing are welcome.