[Tex/LaTex] How to count number of word occurrences

word count

How do you count the number of word occurrences in your tex files? The reason I do this is to easier recognize words I use too much in a text. At the moment I use the following one-liner in the bash.

cat *.tex | sed 's/[[:space:]|[:punct:]]\+/\n/g' | sort | uniq -c | sort -n

What it does is, output all .tex-Files with cat, substitute the whitespaces and punctuations with a line-break using sed, sort the output, count the unique words and sort it again after the number output by uniq -c.

One of the problems I have with that approach is, that words that belong together but are divided by a whitespace are counted separately. So for example "New York" you get k occurrences of New and n occurrences of York, mixing with other occurrences of New and York.

EDIT: Another problem is of course, how do you recognize word inflection such as declension and conjugation? But that's probably something way out of scope of a one-liner, or does anyone have an idea how to cope with that?

EDIT2: As Hendrik and Joseph pointed out, that's not really TeX-related, but perhaps somebody finds it useful 🙂

Best Answer

Cannot say much about the run latex, and then use dvi2tty on the output .dvi file. This would take better care of macro expansion. I suggest therefore:

dvi2tty 00.dvi | sed  's/[[:space:]|[:punct:]|[:digit:]]\+/\n/g' | sed '/^$/d' | tr "A-Z" "a-z" | sort | uniq -c | sort -nr | sed "/ 1 /d" 

which is similar to your pipe, except that

  • It treats digits as separators.
  • It ignores spaces
  • It eliminates empty lines
  • It eliminates words that occur only once.