Machine Learning – Application of Machine Learning Methods on StackExchange Websites

machine learning

I have a Machine Learning course this semester and the professor asked us to find a real-world problem and solve it by one of machine learning methods introduced in the class, as:

I am one of fans of stackoverflow and stackexchange and know database dumps of these websites are provided to the public because they are awesome! I hope I could find a good machine learning challenge about these databases and solve it.

My idea

One idea came to my mind is predicting tags for questions based on the entered words in question body. I think the Bayesian network is the right tool for learning tags for a question but need more research.
Anyway, after learning phase when user finishes entering the question some tags should be suggested to him.

Please tell me:

I want to ask the stats community as experienced people about ML two questions:

  1. Do you think tag suggestion is at least a problem which has any chance to solve? Do you have any advice about it? I am a little worried because stackexchange does not implement such feature yet.

  2. Do you have any other/better idea for the ML project that is based on stackexchange database? I find it really hard to find something to learn from stackexchange databases.


Consideration about database errors:
I would like to point that although the databases are huge and have many instances, they are not perfect and are prune to error. The obvious one is the age of users that is unreliable. Even selected tags for the question are not 100% correct. Anyway, we should consider the percent of correctness of data in selecting a problem.

Consideration about the problem itself: My project should not be about data-mining or something like this. It just should be an application of ML methods in real-world.

Best Answer

Yes, I think tag prediction is an interesting one and one for which you have a good shot at "success".

Below are some thoughts intended to potentially aid in brainstorming and further exploration of this topic. I think there are many potentially interesting directions that such a project could take. I would guess that a serious attempt at just one or two of the below would make for a more than adequate project and you're likely to come up with more interesting questions than those I've posed.

I'm going to take a very wide view as to what is considered machine learning. Undoubtedly some of my suggestions would be better classified as exploratory data analysis and more traditional statistical analysis. But, perhaps, it will help in some small way as you formulate your own interesting questions. You'll note, I try to address questions that I think would be interesting in terms of enhancing the functionality of the site. Of course, there are many other interesting questions as well that may not be that related to site friendliness.

  1. Basic descriptive analysis of user behavior: I'm guessing there is a very clear cyclic weekly pattern to user participation on this site. When does the site get the most traffic? What does the graph of user participation on the site look like, say, stratified by hour over the week? You'd want to adjust for potential changes in overall popularity of the site over time. This leads to the question, how has the site's popularity changed since inception? How does the participation of a "typical" user vary with time since joining? I'm guessing it ramps up pretty quickly at the start, then plateaus, and probably heads south after a few weeks or so of joining.
  2. Optimal submission of questions and answers: Getting insight on the first question seems to naturally lead to some more interesting (in an ML sense) questions. Say I have a question I need an answer to. If I want to maximize my probability of getting a response, when should I submit it? If I am responding to a question and I want to maximize my vote count, when should I submit my answer? Maybe the answers to these two are very different. How does this vary by the topic of the question (say, e.g., as defined by the associated tags)?
  3. Biclustering of users and topics: Which users are most alike in terms of their interests, again, perhaps as measured by tags? What topics are most similar according to which users participate? Can you come up with a nice visualization of these relationships? Offshoots of this would be to try to predict which user(s) is most likely to submit an answer to a particular question. (Imagine providing such technology to SE so that users could be notified of potentially interesting questions, not simply based on tags.)
  4. Clustering of answerers by behavior: It seems that there are a few different basic behavioral patterns regarding how answerers use this site. Can you come up with features and a clustering algorithm to cluster answerers according to their behavior. Are the clusters interpretable?
  5. Suggesting new tags: Can you come up with suggestions for new tags based on inferring topics from the questions and answers currently in the database. For example, I believe the tag [mixture-model] was recently added because someone noticed we were getting a bunch of related questions. But, it seems an information-retrieval approach should be able to extract such topics directly and potentially suggest them to moderators.
  6. Semisupervised learning of geographic locations: (This one may be a bit touchy from a privacy perspective.) Some users list where they are located. Others do not. Using usage patterns and potentially vocabulary, etc, can you put a geographic confidence region on the location of each user? Intuitively, it would seem that this would be (much) more accurate in terms of longitude than latitude.
  7. Automated flagging of possible duplicates and highly related questions: The site already has a similar sort of feature with the Related bar in the right margin. Finding nearly exact duplicates and suggesting them could be useful to the moderators. Doing this across sites in the SE community would seem to be new.
  8. Churn prediction and user retention: Using features from each user's history, can you predict the next time you expect to see them? Can you predict the probability they will return to the site conditional on how long they've been absent and features of their past behavior? This could be used, e.g., to try to notice when users are at risk of "churn" and engage them (say, via email) in an effort to retain them. A typical approach would shoot out an email after some fixed period of inactivity. But, each user is very different and there is lots of information about lots of users, so a more tailored approach could be developed.