[Tex/LaTex] How to count number of word occurrences

How do you count the number of word occurrences in your tex files? The reason I do this is to easier recognize words I use too much in a text. At the moment I use the following one-liner in the bash.

cat *.tex | sed 's/[[:space:]|[:punct:]]\+/\n/g' | sort | uniq -c | sort -n

What it does is, output all .tex-Files with cat, substitute the whitespaces and punctuations with a line-break using sed, sort the output, count the unique words and sort it again after the number output by uniq -c.

One of the problems I have with that approach is, that words that belong together but are divided by a whitespace are counted separately. So for example "New York" you get k occurrences of New and n occurrences of York, mixing with other occurrences of New and York.

EDIT: Another problem is of course, how do you recognize word inflection such as declension and conjugation? But that's probably something way out of scope of a one-liner, or does anyone have an idea how to cope with that?

EDIT2: As Hendrik and Joseph pointed out, that's not really TeX-related, but perhaps somebody finds it useful 🙂

Best Answer

Cannot say much about the run latex, and then use dvi2tty on the output .dvi file. This would take better care of macro expansion. I suggest therefore:

dvi2tty 00.dvi | sed  's/[[:space:]|[:punct:]|[:digit:]]\+/\n/g' | sed '/^$/d' | tr "A-Z" "a-z" | sort | uniq -c | sort -nr | sed "/ 1 /d"

which is similar to your pipe, except that

It treats digits as separators.
It ignores spaces
It eliminates empty lines
It eliminates words that occur only once.

Best Answer

Related Solutions

[Tex/LaTex] print word count

[Tex/LaTex] Word count without footnotes

Related Question