Make a list of all words in Latex

lists

Is there a way to make a list of all the words that are being used in a Latex document? Alternatively, if someone knows another way to do it that could also be helpful, e.g. by using Python, a website, or something else

Here is an example of what I would like:

\documentclass{article}
\begin{document}
I have a dog and a cat.
The dog and the cat are named Bob and John.
\end{document} % Should maybe be after the list


list:
I 
have
a 
dog 
and 
cat
the 
are 
named 
bob 
john

The order of the words in the list does not matter.
And thank you if you can help.

Best Answer

For some definition of "word" and "being used" you can extract the text from the PDF and process to a list.

pdflatex file1
pdftotext file1.pdf

will produce file1.txt

I have a dog and a cat. The dog and the cat are named Bob and John.

1

Which you can process with (standard linux utilities that would also be available on windows if needed, actually I am using cygwin versions on windows)

Then

cat file1.txt | tr '[:space:][,.]' '[\n*]' | tr '[:upper:]' '[:lower:]' | sort | uniq

Produces the list:

1
a
and
are
bob
cat
dog
have
i
john
named
the

The long command pipe is doing at each step:

  • replace white space and punctuation by newline
  • lowercase the resulting words
  • sort alphabetically
  • remove duplicates.
Related Question