Solved – Examples of text mining with R (tm package)

rtext mining

I spent three days dabbling with tm after reading a draft paper by a friend where he explored a text corpus with UCINET, showing text clouds, two-mode network graphs and Single Value Decomposition (with graphics, using Stata). I ran under a large number of issues: on Mac OS X, there are issues with the Java behind libraries like Snowball (stemming) or Rgraphviz (graphs).

Could someone point out not packages – I have looked at tm, wordfish and wordscores, and know about NLTK – but research, if possible with code, on textual data, that successfully uses tm or something else to analyse data like parliamentary debates or legislative documents? I cannot seem to find much on the issue, and even less code to learn from.

My own project is a two-month parliamentary debate, with these variables informed in a CSV file: parliamentary session, speaker, parliamentary group, text of oral intervention. I am looking for divergence between speakers and especially between parliamentary groups in the use of rare and less rare terms, e.g. "security talk" against "civil liberties" talk.

Best Answer

The PhD Dissertation from the Author of tm, Ingo Feinerer from Austria, is written in the English language. Chapters 7-10 of this document contain applications of the tm package, with increasing complexity.

http://epub.wu.ac.at/1923/

Chapter 7 presents an application of tm by analyzing the R-devel 2006 mailing list. Chapter 8 shows an application of text mining for business to consumer electronic commerce. Chapter 9 is an application of tm to investigate Austrian supreme administrative court jurisdictions concerning dues and taxes. [...]. Chapter 10 shows an application for stylometry and authorship attribution on the Wizard of Oz data set.

Read the whole document cover to cover. Note, however, that the document was written in 2008, and since then there have been a few API changes, for instance, the PhD thesis mentions a function tmMap() that has been renamed to tm_map(). So the code examples won't work as-is, you cannot use cut-and-paste to try them.

You can also go to

http://tm.r-forge.r-project.org/users.html

"In an attempt to inform new users about existing tm applications this site aims to provide (an incomplete alphabetical) list of tm users and their comments. Known users range from research institutes over companies to individuals. "

and search on that page for the phrase "wrote a paper" and you'll find many links. I've read only one of the papers, "automatic topic detection in song lyrics". Quite interesting, and funny.