Solved – Linear Regression and big data

large datarandom forestregression

I have a very large data set ( 79 features: 74 categorical and 5 measurable ) on a very sparse matrix ( 79 x 1,500,000 rows )

The data set is done in this way:

phone model, carrier, day of the week, time range, time to landing page, time to external redirect, screen size, screen megapixel, user subscribed.

categorical features are
phone model ( iPhone 5, Samsung A6…. )
carrier ( tim vodafone … )
day of week ( monday …. sunday )
time range ( between 00:00 – 08:00 , 08:01-12:00 …. )
screen size ( several sizes )
screen megapixel ( several sizes )
user subscribed ( yes or no )

All these features have been splitted in columns, these columns contain 0 or 1.

example:

Iphone 5, Samsung A6,... Vodafone, ..., Monday,..... Time 00-08, ... Screensize 3", .... Megapixel 1M, .... 

1,0,.....1,......1,....1,....1,.....1

Means vodafone iPhone 5 recorded on monday between 00:00 – 08:00 with 3" screen size and 1M pixel on the screen…

There are 74 columns which can be 0 or 1,mostly are set to 0.

The value to optimise is the subscription, we know when the user did subscribe the service and want to know which feature can maximize the subscription.

Linear regression completes with success but coefficients are very low (i.e. 10^-9 or less ) and intercept is very close to 1. All the features have been normalized with Min Max. CHI SQUARE test says that it does not work.

It's a mess and I'm starting to thinking it is not the good path to reach the result, this data set is so huge and has so many categorical features.

Does it make sense apply linear regression over this data set?

I'm thinking that the best is the random forest, which can says for real which features help to reach the wanted value.

It is possibile that linear regression is failing due to numerical approximation ?

What's it the best way to achieve the subscriptions maximisation ?

Best Answer

You have a large $N$ but small $P$ problem, which means you have many data but not too many features. Linear regression can work well on such data set and done in reasonable amount of time.

To understand if linear model fits your data well, I would suggest to use a sample of data to fit model and exam the performance. After that, use more samples and observe what will happen on training and testing performance. This is essentially investigating "learning curve". Check my detailed answer in this post to know if you are over fitting or under fitting.

How to know if a learning curve from SVM model suffers from bias or variance?

With the information you provided, here are things may happen in your data

  • Linear model has high bias, if you have over 1 million rows, it is very possible you have a "under-fitting" model. Where the model is "stable" but have not very good performance.

  • As discussed in the comments, it also matters a lot on how you do the categorical feature coding. If you have 74 categorical features, each one has say 20 unique values, then the design matrix can be reasonable large and you may have $P>N$ problem or over fitting. But if all of these features are binary abd most values are 0 then it is ok to use linear model.