Solved – Logistic regression is slow

classificationlogisticpython

I am relatively new to machine learning and I have a data classification problem where each sample has ~1500 features (continuous) and the category is binary.

I want to apply logistic regression, for learning purposes. I have done a vectorized implementation in Python, which works very well and is fast on other datasets with a smaller number of features (order of magnitude 10s-100s features).
On my dataset, however, I can't even finish the training (on ~3000 samples), as it simply takes ages.

Could this be a problem of: 1) my vectorization, 2) the number of features in a sample (time for some dimensionality reduction?), 3) the fact that the problem is not suited to logistic regression, 4) using Python, or something else altogether?

Best Answer

Well, considering you use a custom algorithm and that overfitting might be or might not be involved, it's really hard to say. So what are your possibilites? 1. Your algorithm might be badly written 2. The algorithm you use might be a bad choice 3. The algorithm does not converge (overfitting or collinearity, and has no alternative stopping condition 4. Unknown issue with Python.

The third options seems very likely, but all are very possible.

A recipe to get you a bit further is to just print (or log to a file) all the intermediate steps of your algorithm, the updated solution (I assume an iterative algorithm) and so on. This might just help pinpoint what is going wrong. For example, an indicator for overfitting can be that the coefficients get extremely large.

I just tried the same with R with a response binomially distributed (p=0.5) and 1500 normally distributed (mean=0, sd = 1, independent) features and no connection to the response. It finished after 1-2 minutes but warned me that it did not converge. Coefficients got very large (E13) -> overfitting.

Related Question