I spent three days dabbling with tm
after reading a draft paper by a friend where he explored a text corpus with UCINET, showing text clouds, two-mode network graphs and Single Value Decomposition (with graphics, using Stata). I ran under a large number of issues: on Mac OS X, there are issues with the Java behind libraries like Snowball (stemming) or Rgraphviz (graphs).
Could someone point out not packages – I have looked at tm
, wordfish
and wordscores
, and know about NLTK – but research, if possible with code, on textual data, that successfully uses tm
or something else to analyse data like parliamentary debates or legislative documents? I cannot seem to find much on the issue, and even less code to learn from.
My own project is a two-month parliamentary debate, with these variables informed in a CSV file: parliamentary session, speaker, parliamentary group, text of oral intervention. I am looking for divergence between speakers and especially between parliamentary groups in the use of rare and less rare terms, e.g. "security talk" against "civil liberties" talk.
Best Answer
The PhD Dissertation from the Author of tm, Ingo Feinerer from Austria, is written in the English language. Chapters 7-10 of this document contain applications of the tm package, with increasing complexity.
http://epub.wu.ac.at/1923/
Read the whole document cover to cover. Note, however, that the document was written in 2008, and since then there have been a few API changes, for instance, the PhD thesis mentions a function
tmMap()
that has been renamed totm_map()
. So the code examples won't work as-is, you cannot use cut-and-paste to try them.You can also go to
http://tm.r-forge.r-project.org/users.html
and search on that page for the phrase "wrote a paper" and you'll find many links. I've read only one of the papers, "automatic topic detection in song lyrics". Quite interesting, and funny.