Solved – Logistic regression: how to choose negative examples for training set

logisticmachine learningregression

I want to predict the probability of rain based on the measured weather parameters like temperature, humidity, etc. Let's not get into why I want to do that despite the fact that weather websites already publish the probability of rain.

I want to implement this using logistic regression. I have weather data for 2012, every 15 minutes for each day. I also know the date and time during which it rained (Yes/No label). Feature vector comprises a concatenation of weather parameters (temperature, atmospheric pressure, wind speed, humidity) for t0, t0-15 min and t0-30m minutes if it rained at t0. Thus, I can create my supervised learning dataset with positive examples.

However, I am confused about how many negative examples I should choose? Negative feature vector would be derived in a similar fashion but when t0 does not have rain.

Here are my questions:

  1. Should I choose equal number of positive and negative examples? Does my learning depend on how many examples of each category I include in my training set?

  2. If learning does depend upon the number of positive and negative examples, how many negative examples should I choose?

I know there are many other ways to doing this prediction but please try to answer the questions regarding logistic regression only. I am not looking for other approaches.

Best Answer

The number of negative training cases does matter, in just the same way that re-weighting your existing training set (as in boosting) will produce a different model, one that does better on the training cases whose weights have increased, at the expense of accuracy on the set of training cases whose weights have decreased.

For your example, I would suggest using all the negative examples you can find (it's rarely a good idea to throw away data), but re-weight your examples so that positive and negative examples each have 50% total weight (or whatever relative weights make sense in your particular use case).

Related Question