First question. You are right about being able to use software instead of tables
of the chi-squared distribution. For example, if df = 9 and the
chi-squared statistic is 20.16, you could look at a chi-squared
table to see that $20.16 > 19.02,$ where 19.02 cuts area 0.025
from the upper tail of $Chisq(df = 9)$.
You you would reject at
that 2.5% level.
If you wanted a P-value, you could use software
to find the probability of the chi-squared statistic being
greater than 20.16. In R software this is computed as follows,
where pchisq
stands for the CDF of a chi-squared distribution:
1 - pchisq(20.16, 9)
## 0.01695026
Thus the P-value (probability of a value more extreme than 20.16)
is about 0.017. Some software will give you the P-value automatically.
Second question. As far as binning is concerned, you are right that in some
instances there are alternate possible ways of binning. You do not
want so many bins that the expected counts in each bin get less than about 5, otherwise the approximation of the chi-squared statistic
to the chi-squared distribution is not good. Given that restriction,
it is usually better to use more bins rather than fewer.
Also notice
that the df of the chi-squared distribution depends directly on
the number of $bins$ used, not on the overall number of $events$ counted.
(I do not understand what you say about 'approximately Gaussian'
in this context.)
Examples: Here is an example in which we simulate 60 rolls of a fair die, so that we expect 10 instances of each face. The observed numbers
of each face are tabulated. Finally, a chi-squared test that the
die is fair has a chi-squared goodness-of-fit statistic of 3.0,
and a P-value of 70% (consistent with a fair die).
face = sample(1:6, 60, rep=T) # simulate 60 rolls of fair die
table(face)
## face
## 1 2 3 4 5 6
## 9 6 12 10 10 13
chisq.test(table(face))
## Chi-squared test for given probabilities # default is equal probabilities
## data: table(face)
## X-squared = 3, df = 5, p-value = 0.7
In the test, the default is that faces have equal probabilities
unless some other probability vector is specified. The test procedure
chisq.test
finds the P-value as follows (and rounds):
1 - pchisq(3, 5)
## 0.6999858
In our second example, we simulate 600 rolls of a die that
is heavily biased in favor of faces 4, 5, and 6 (see prob
vector). Here
the null hypothesis that the die is fair is soundly rejected
with an extremely small P-value.
face = sample(1:6, 600, repl=T, prob=c(1,1,1,2,2,2)/9 )
table(face)
## face
## 1 2 3 4 5 6
## 59 67 80 123 135 136
chisq.test(table(face))
## Chi-squared test for given probabilities # default is test for 'fair' die
## data: table(face)
## X-squared = 62.2, df = 5, p-value = 4.263e-12
Both offered answers do not improve on the main bottleneck, which is the $O(n\log n)$ sorting running time (for an array of size $n$).
One can use a clever selection algorithm, which partially sorts the data, to avoid the full sort, and end up with a linear-time computation.
Most of the time this is used to compute the median, which is the 50-th percentile, but can be used for arbitrary selection. This makes heavy use of the main routine behind QuickSort and hence is called the QuickSelect. Here is the Wiki link for the full discussion of the field.
Best Answer
Once you have an approximation to the $90$th percentile, for example, you can make one pass through the data and find the exact value. Set a range that you are sure the actual $90$th percentile lies in. Read the data, counting the values that fall below your range, retaining the values in your range, and counting the values above your range. The better your approximation, the narrower the range can be and the fewer values you need to save. Then using the counts, you can find the rank in your range that is the exact $90$th percentile.