Solved – Computing confidence intervals for count data

confidence interval

I have a representative sample token on area of 60 mq (squared meter) of a territory of 54077 mq. This sample contains the number of little plants that there are for each mq. The sample is defined in R as:

s=c(13,7,10,4,28,0,10,0,0,0,0,0,0,0,0,0,6,
    0,0,0,0,0,0,0,4,0,0,0,4,0,0,0,1,2,2,0,
    2,3,3,3,1,3,12,33,1,31,0,1,21,0,3,1,8,
    0,1,1,6,0,2,0)

The sum of s is 227.

To compute the CI of s I used a t.test (also it is not a normal distribution: it is not that my problem).

t.test(s)
t = 3.9606, df = 59, p-value = 0.0002039
mean of s =  3.783333 - 95 % confidence interval:  1.871905 - 5.694762

My question is:

Can I assume that the CI of number of plant in the entire territory is between
$1.871905\times 227\times (54077/60) – 5.694762\times 227\times (54077/60)$?

I think NO because the count is very simplified but I hope YES.

Best Answer

As Hans Engler suggested, a bootstrap should work well for these data. You can use the boot or bootstrap packages directly, but it’s much easier to use the simpleboot package:

s = c(13,7,10,4,28,0,10,0,0,0,0,0,0,0,0,0,6,
      0,0,0,0,0,0,0,4,0,0,0,4,0,0,0,1,2,2,0,
      2,3,3,3,1,3,12,33,1,31,0,1,21,0,3,1,8,
      0,1,1,6,0,2,0)

library(simpleboot)
b = one.boot(s, mean, R=10^4)
boot.ci(b, type="perc")

The output is:

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = b, type = "perc")

Intervals : 
Level     Percentile     
95%   ( 2.10,  5.85 )  
Calculations and Intervals on Original Scale

As you can see (and as one might have expected), the confidence interval is not very different from the confidence interval from the t-test (1.87–5.69).

Related Solutions

Confidence Interval – How to Interpret Confidence Intervals?

I think the fundamental problem is that frequentist statistics can only assign a probability to something that can have a long run frequency. Whether the true value of a parameter lies in a particular interval or not doesn't have a long run frequency, becuase we can only perform the experiment once, so you can't assign a frequentist probability to it. The problem arises from the definition of a probability. If you change the definition of a probability to a Bayesian one, then the problem instantly dissapears as you are no longer tied to discussion of long run frequencies.

See my (rather tounge in cheek) answer to a related question here:

"A Frequentist is someone that believes probabilies represent long run frequencies with which events ocurr; if needs be, he will invent a fictitious population from which your particular situation could be considered a random sample so that he can meaningfully talk about long run frequencies. If you ask him a question about a particular situation, he will not give a direct answer, but instead make a statement about this (possibly imaginary) population."

In the case of a confidence interval, the question we normally would like to ask (unless we have a problem in quality control for example) is "given this sample of data, return the smallest interval that contains the true value of the parameter with probability X". However a frequentist can't do this as the experiment is only performed once and so there are no long run frequencies that can be used to assign a probability. So instead the frequentist has to invent a population of experiments (that you didn't perform) from which the experiment you did perform can be considered a random sample. The frequentist then gives you an indirect answer about that fictitious population of experiments, rather than a direct answer to the question you really wanted to ask about a particular experiment.

Essentially it is a problem of language, the frequentist definition of a popuation simply doesn't allow discussion of the probability of the true value of a parameter lying in a particular interval. That doesn't mean frequentist statistics are bad, or not useful, but it is important to know the limitations.

Regarding the major update

I am not sure we can say that "Before we calculate a 95% confidence interval, there is a 95% probability that the interval we calculate will cover the true parameter." within a frequentist framework. There is an implicit inference here that the long run frequency with which the true value of the parameter lies in confidence intervals constructed by some particular method is also the probability that that the true value of the parameter will lie in the confidence interval for the particular sample of data we are going to use. This is a perfectly reasonable inference, but it is a Bayesian inference, not a frequentist one, as the probability that the true value of the parameter lies in the confidence interval that we construct for a particular sample of data has no long run freqency, as we only have one sample of data. This is exactly the danger of frequentist statistics, common sense reasoning about probability is generally Bayesian, in that it is about the degree of plausibility of a proposition.

We can however "make some sort of non-frequentist argument that we're 95% sure the true parameter will lie in [a,b]", that is exactly what a Bayesian credible interval is, and for many problems the Bayesian credible interval exactly coincides with the frequentist confidence interval.

"I don't want to make this a debate about the philosophy of probability", sadly this is unavoidable, the reason you can't assign a frequentist probability to whether the true value of the statistic lies in the confidence interval is a direct consequence of the frequentist philosophy of probability. Frequentists can only assign probabilities to things that can have long run frequencies, as that is how frequentists define probability in their philosophy. That doesn't make frequentist philosophy wrong, but it is important to understand the bounds imposed by the definition of a probability.

"Before I've entered the password and seen the interval (but after the computer has already calculated it), what's the probability that the interval will contain the true parameter? It's 95%, and this part is not up for debate:" This is incorrect, or at least in making such a statement, you have departed from the framework of frequentist statistics and have made a Bayesian inference involving a degree of plausibility in the truth of a statement, rather than a long run frequency. However, as I have said earlier, it is a perfectly reasonable and natural inference.

Nothing has changed before or after entering the password, because niether event can be assigned a frequentist probability. Frequentist statistics can be rather counter-intuitive as we often want to ask questions about degrees of plausibility of statements regarding particular events, but this lies outside the remit of frequentist statistics, and this is the origin of most misinterpretations of frequentist procedures.

Solved – Confidence Intervals for AUC using cross-validation

Here is a sample of how you would do it in python.

from sklearn import cross_validation
scores = cross_validation.cross_val_score(your_model, your_data, y, cv=10)
mean_score = scores.mean()
std_dev = scores.std()
std_error = scores.std() / math.sqrt(scores.shape[0])
ci =  2.262 * std_error
lower_bound = mean_score - ci
upper_bound = mean_score + ci

print "Score is %f +/-  %f" % (mean_score, ci)
print '95 percent probability that if this experiment were repeated over and    
over the average score would be between %f and %f' % (lower_bound, upper_bound)

Best Answer

Related Solutions

Confidence Interval – How to Interpret Confidence Intervals?

Solved – Confidence Intervals for AUC using cross-validation

Related Question