Solved – Logistic regression for ranking: how do you represent inter-human variation

logistic

This machine learning contest in Kaggle has a benchmark solution that uses logistic regression: http://www.kaggle.com/c/predict-who-is-more-influential-in-a-social-network

The context provides data where two twitter users are compared based on the values of same features. So for each data row in the learning set we have

Twitter_User_A_More_influential | Twitter User_B_moreInfluential | {User_As_feature_vals} | {User_Bs_feature_vals}

If user A is more influential, the value is 1 (and B is 0) and the reverse if B is more influential.

What is interesting is that the benchmark solution ( https://gist.github.com/fhuszar/5372873 ) is using logistic regression. I am not sure if logistic regression really encapsulates the model we have here, and I'd like to know how it may be improved (if it may be improved at all), especially to include inter-human variation. See the snippet from the python code used in the benchmark solution:

X_train = transform_features(X_train_A) - transform_features(X_train_B)
model = linear_model.LogisticRegression(fit_intercept=False)
model.fit(X_train,y_train)

This code transforms feature values of twitter user A and B (using new_ val = log(1+x)) and builds an array that has the difference between the transformed/regularized values. y_train contains ranking values for A: 1 or 0. The code fits logistic regression using training data and ranking values of A.

The simplifications I see in this implementation are:

  1. It turns ranking into classification, expressing more influential as influential or not.
  2. It expresses the relation between the difference (change) in feature values and preferences of human agents.
  3. It does not include anything that represents variation of judgement from N agents.

Number 3 feels like the biggest simplification to me. As it is, this model seems to represent a single individual's ranking of N comparisons, but the definition of data mentions more than 1 user (presumably close to N).

I've been thinking about how to include that aspect in logistic regression, but so far I could not find a way of doing it. How would you represent the N human agents in terms of a random variable here? I'd appreciate different ways of modelling this data, from a statistics perspective.

Best Answer

So far, my findings point towards hierarchical modelling and mixture models. A statistician from my department also confirmed this.