Solved – Algorithms to find most frequent data

algorithmsr

I want to find most frequent words in some cases that some words stand out. I have two data for each case, the word and the number of times it appears.

I want to know which algorithms, techniques can be used to find the most frequent data.

At the moment, I'm using the discard for data normalization (Mean + Standard Deviation * 3), but I think there are another techniques.

What I can use?

Edit: I know what I want, but I don't know what is possible because I don't know much about the statistics area. I have a data frame in R which contain words and the number of times each words appears. This data is get from a large database (million rows).

What I want to find are the words that are much frequent than others. Example, I have a list of commons words that came after commas. I want to differentiate which are the really common words than the words that appearing only because there is a lot of data.

Indication of frequency algorithms/formulas are what I want (I'm a software developer, not an statistician, I don't know the right names).

Best Answer

Your problem is not an algorithms question. It will not be a well-posed algorithms question until you can precisely define what you mean by "most frequent". At this point, you seem to be asking about what criteria you should use to divide the words into "especially frequent" vs "not especially frequent". That's not an algorithms question. If you knew what criteria you had in mind, then we could talk about algorithms to efficiently categorize the data according to that criteria.

Your choice of criteria for what counts as "especially frequent" probably should be based upon how you are going to use that categorization. Once you've found a list of words that are especially common, what are you going to do with that list? If you can answer that, we may be able to give you some guidance on criteria.

Some example criteria might be:

  1. A word is especially common if it is among the 0.1% most common words. (Algorithm: sort the words by their frequency, keep the top 0.1% words from that list.)

  2. The especially common words are those whose frequency is at least 2$\times$ higher than that of all other words. (Algorithm: sort the words by their frequency and scan the sorted list from most-frequent to least-frequent. For each word, test if its frequency is at least 2$\times$ higher than the frequency of the next word on the list; if so, it is a potential dividing line. Use the last dividing line you saw.)

  3. A word is especially common if its frequency is at least 3 standard deviations above the mean. (Algorithm: scan the words once to compute the mean and standard deviation of the frequencies, then scan a second time to find all that meet the criteria.) (Note: this criterion is probably only relevant if you have some reason to expect the frequencies to be normally distributed. In practice, I would not expect them to be normally distributed, so I would be skeptical of this criterion.)

In general, once you know the criteria you have in mind, it is relatively easy to process all the data and identify those words that match the criteria. If you are given all the data in advance, this is pretty easy: you probably won't need any fancy algorithmics.

The place where fancy algorithmic techniques become necessary is in a streaming context (also known as online processing), where you are given each word one at a time, and you don't have enough memory/storage to keep a copy of all the words you've seen. There are sophisticated algorithms for identifying especially common words, in that setting; see, e.g., the heavy hitters algorithm and other streaming algorithms. In each case, though, you will first need to decide for yourself what criteria you want to use to classify words as especially common or not; once you know that, then we can start to consider the algorithm question.

Related Question