[Tex/LaTex] Does the wordcount package do a proper wordcount

word count

I need to Dynamically count and return number of words in a section with latex and not luatex or context. The latex answer to that question relies on the texcount package, which from my understanding of the documentation parses a LaTeX document with a complicated Perl script. Since LaTeX cannot easily be parsed with Perl, texcount misses all sorts of things (e.g., inline citations) that I need to count.

One of the answers to Is there any way to do a correct word count of a LaTeX document? suggests that the wordcount package can be used to do a "correct" word count. From my understanding wordcount changes how latex works so that the log file can be parsed for the number of characters, words, and spaces. The approach of wordcount seems much more robust than texcount. The package documentation mentions issues with tables and math. As the documents I need to do the word count on will not have tables or math, it seems wordcount might be better.

Looking at the questions and answers in suggests that texcount is the preferred way to count words. Further, texcount is much newer than wordcount. This makes me worried that I am missing something and that wordcount does not accurately count the words.

Best Answer

Since this is too long to fit a comment, I'll make it into an answer.

Also, beware that as the maker of TeXcount I'm not entirely unbiased, although I'll try to give a fair answer.

First, all word counters will have technical limitations as well as choices as to what to count (or not count) as words. This is particularly true of documents that, like many LaTeX documents, are more complex than pure prose.

Technical limitations for LateX based word counts will often be related to the ability to interpret macros and environments.

In addition there will be choices, some of which may be guided by technical issues, as to what to count as a word: formulas, footnotes, titles, captions, citations, etc. Technical limitations aside, there is no unique answer to which is the correct word count: different people and journals will have different opinions, and may even differ from case to case based on the type of manuscript.

The only way to check if any word counter counts what you want it to count is to check it.

My main claim as regards TeXcount is not primarily that it is more accurate than others, since I haven't make a proper/fair comparison and it might anyway depend on the type/writing style of the document and user's preference, but that it enables you to check what it counts as words by providing you with an annotated (colour coded) version of the TeX files.

You can check out TeXcount without any installation using the web service to see how it works on your document. (Note that the connection is by HTTP which means your document will be sent in cleartext over the Internet, so don't use this for confidential material.)

There are ways in TeXcount to tweak the counting by adding additional macro/environment handling rules or mark parts of the text that should be ignored, but the core strength in terms of assessing the accuracy is the annotated output which enables you to check in detail how your document was processed.


Let me add a bit to my original rant to better answer the more technical side of your question.

When I first made TeXcount, it was because I couldn't find a LaTeX word counter that worked on my own LaTeX documents. This was probably due to how my documents were structured. In turn, early version of TeXcount failed similarly on some other people's documents since their LaTeX style differed from what I had prepared TeXcount to handle. Over time TeXcount has hopefully gotten more robust, in part due to feedback from users over several years.

TeXcount, as you note, uses a Perl script to parse the file(s), and does not actually run any of the TeX macros. As such, text produced by macros will normally not be counted. You can specify that a macro, eg \LaTeX, represents a word, but that is basically it. Any word counter that does not actually run TeX will face similar limitations.

Word counters that actually run TeX may be able to count text generated by macros, but run into the limitation of being implemented in TeX. It may become fragile, as I experienced on my own documents when they failed to run properly, or become less flexible in controlling which parts/contexts of the document should be counted.

One of the main risks for any TeX file word counter is that large parts of the document gets ignored or misinterpreted due to some technical issue which is not immediately obvious to the user. Next will often be systematic errors (or biases) in how it counts special cases (dashed terms, numbers, formula, special characters, etc).

The final answer is that the best way to find out is to check what actually gets counted.

Related Question