Solved – How to analyze the incoming email

text mining

I would like to analyze the email I receive in my Gmail inbox in order to systematically come up with effective Gmail filters for the most common types of email.

I am prepared to manually curate and classify a very large volume of emails (perhaps several 1000) to supply training data, if a machine learning strategy requires it.

What I am hoping to get from the analysis: You may be familiar with email management techniques such as Inbox Zero. Generally, they rely on triage of emails. Because I get a very large amount of irrelevant email (mailing lists which are mostly uninteresting, coworkers who unnecessarily CC unimportant email to everyone, emails which the sender thought I would want to read but was mistaken) triage takes too much time. Therefore, I want to see if statistical analysis of the text, type of attachment, sender and recipients can be used to automate triage of the "worst offender" classes of email.

Some examples of actual data I would like to get:

  • Word clouds for emails I consider important and/or urgent vs. email that I will ignore.
  • Bayesian rules such as "emails containing a hyperlink or attachment are more important than usual if coming from person X, but not more important if from person Y, and less important from person Z" (perhaps X is my boss who sends me important documents, while Z is a coworker who only ever uses attachments to share cat pictures)
  • A rule that could filter emails such as announcement lists, by checking if it contains an announcement I would like to see (based on keywords and/or scheduled time of announcement)

I know there is a wealth of know-how in the tech industry for statistical analysis of email. Google, for instance, already does almost what I want with their new inbox tabs feature. Unfortunately their implementation is too opaque and gives too little control to the user, so I would prefer to do my own.

How can I do this analysis? Can I use any existing tools, or must I program my own?

UPDATE: I have since realized that it is possible to simply use an email client like Thunderbird to download all of my Google email from the POP3 server. Then I can export these emails to some appropriate format from Thunderbird, and programmatically read the content of these files to access the key data, such as sender and body of each email.

The problem then reduces to a simpler (from a practical standpoint) problem of doing statistical analysis of text files (perhaps with certain attributes, if parameters besides the body text are to be regarded).

Best Answer

Naive Bayes Classifiers are very simple and surprisingly good at what you're talking about. SpamBayes is an open-source project you might be interested in, but Naive Bayes implementations exist in most languages.