From a response to comment, we can adopt an urn model. The urn contains 100,000 balls representing all cases. An unknown number of these are black ("invalid"); they are of no interest. We are interested solely in the non-black balls in the urn. Of those, some are of color "A" and others of color "B". The main research question appears to be "what proportion of the balls of interest are A's?"
This urn model says option (2) is the one to use.
A simple random sample (without replacement) of 2,000 balls from this urn yielded 1000 black balls, 300 A's, and 700 B's, for n = 1000 A's & B's. The rest is routine. In particular, the distribution of A's (conditional on a non-black ball being drawn) is Binomial(p, 1000). A standard estimate of p is #A's / (Total A's & B's) = 30%. The estimated variance of the total is p(1-p), whence the variance of the estimated proportion of A's equals p(1-p)/n = 0.00021. Its square root, 1.45%, is the standard error of estimate of p. Because the numbers of A's and B's are large, yet are small compared to the expected number of non-black balls (about 50,000), it is appropriate to use normal-theory confidence intervals and to ignore the correction for sampling without replacement. (The correction shrinks the confidence interval to 0.99 times its width.) A 99% two-sided confidence interval therefore extends 2.58 * 1.45% = 3.73% to either side of the estimated proportions E.g., a confidence interval for the proportion of A's (out of all the A's and B's in the urn) extends from 26.27% to 33.73%.
If you are uncomfortable using conditional probabilities (which is at the root of this analysis), you can estimate the contents of the urn (i.e., total numbers of black balls, A's, and B's) using the multinomial distribution. You will get exactly the same results, because in the end you care only about the proportion of A's relative to the numbers of A's and B's, so all estimates involving the number of black balls never enter the calculation.
Another way to get some intuition is to recognize that (except for the tiny correction term being neglected here) the size of the confidence interval depends only on the observed numbers of A's and B's and not on the number of balls in the urn. That's why there's no concern here about whether the "population" is 50,000 or 100,000.
An auxiliary research question seems to be to estimate the total number of A's and B's in the urn. For this purpose the urn contains only two kinds of balls, black ones and non-black ones, and we want to estimate the number of non-black balls. This is a standard binomial sampling situation. Without more ado, the estimated number of non-black balls equals 100,000 * (1000/2000) = 50,000 and the estimated proportion is 1/2, with standard error $\sqrt{(1/2)(1 - 1/2)/2000}$ = 1.1%. Therefore the estimate of 50,000 has a 99% two-sided confidence interval from 48,560 to 51,440.
Best Answer
If there is just one category of interest, e.g. lemons, there's no problem with "binarizing" it and extracting a CI for the population proportion who like lemons. However, if you do this for each category, you might end up with something misleading because the numbers of responses for each category are dependent and you are making multiple probability statements, and so the probability that each proportion lies in your interval could be very different than you might expect. You could report a "confidence region," or "simultaneous confidence intervals" that capture this dependency better. See Sison and Glaz.