The devil is in the details. The help page po.test
, you would have found this:
If lshort is TRUE, then the truncation lag parameter is set to
trunc(n/100), otherwise trunc(n/30) is used.
And in help page of ca.po
:
Usage
ca.po(z, demean = c("none", "constant", "trend"),
lag = c("short", "long"), type = c("Pu", "Pz"), tol = NULL)
...
lag Either a short or long lag number used for variance/covariance
correction.
So you can guess that the number of lags is chosen differently. The code from the functions justify this hypothesis. The code from po.test
:
if (lshort)
l <- trunc(n/100)
else l <- trunc(n/30)
From the ca.po
:
if (lag == "short") {
lmax <- trunc(4 * (nobs/100)^0.25)
}
else if (lag == "long") {
lmax <- trunc(12 * (nobs/100)^0.25)
}
Hence the statistics are actually different and so are the results.
This is not uncommon situation in testing for unit-roots and cointegration. If different statistics give different results, this usually means that something is missing. Also note that in general these statistics do not deal well with structural breaks, so if there are events which might of introduced structural breaks it would be prudent to take them into account.
Your problem is not an algorithms question. It will not be a well-posed algorithms question until you can precisely define what you mean by "most frequent". At this point, you seem to be asking about what criteria you should use to divide the words into "especially frequent" vs "not especially frequent". That's not an algorithms question. If you knew what criteria you had in mind, then we could talk about algorithms to efficiently categorize the data according to that criteria.
Your choice of criteria for what counts as "especially frequent" probably should be based upon how you are going to use that categorization. Once you've found a list of words that are especially common, what are you going to do with that list? If you can answer that, we may be able to give you some guidance on criteria.
Some example criteria might be:
A word is especially common if it is among the 0.1% most common words. (Algorithm: sort the words by their frequency, keep the top 0.1% words from that list.)
The especially common words are those whose frequency is at least 2$\times$ higher than that of all other words. (Algorithm: sort the words by their frequency and scan the sorted list from most-frequent to least-frequent. For each word, test if its frequency is at least 2$\times$ higher than the frequency of the next word on the list; if so, it is a potential dividing line. Use the last dividing line you saw.)
A word is especially common if its frequency is at least 3 standard deviations above the mean. (Algorithm: scan the words once to compute the mean and standard deviation of the frequencies, then scan a second time to find all that meet the criteria.) (Note: this criterion is probably only relevant if you have some reason to expect the frequencies to be normally distributed. In practice, I would not expect them to be normally distributed, so I would be skeptical of this criterion.)
In general, once you know the criteria you have in mind, it is relatively easy to process all the data and identify those words that match the criteria. If you are given all the data in advance, this is pretty easy: you probably won't need any fancy algorithmics.
The place where fancy algorithmic techniques become necessary is in a streaming context (also known as online processing), where you are given each word one at a time, and you don't have enough memory/storage to keep a copy of all the words you've seen. There are sophisticated algorithms for identifying especially common words, in that setting; see, e.g., the heavy hitters algorithm and other streaming algorithms. In each case, though, you will first need to decide for yourself what criteria you want to use to classify words as especially common or not; once you know that, then we can start to consider the algorithm question.
Best Answer
Please see link: TOP 100 R PACKAGES FOR 2013 (JAN-MAY) http://www.r-statistics.com/2013/06/top-100-r-packages-for-2013-jan-may/