Solved – Bad Linear regression results

lassomachine learningpythonregression

I have a dataset, and i have to predict the flow of users at a certain city given some information like the day of the week, the month, the distance of the city of origin ecc..

First i decided to plot the heatmap of correlation, to see if there are correlations between the features, and this is the result:

enter image description here

As we can see there's no much correlations between the features.

I have done Linear regression obtaining very bad results (R^2 = 0.1).

I have done Lasso Regression in order to drop the bad features but the best result for Lasso is given by lambda=0, so the best result is using all the features.

My question is, is it possibile that the dataset is very bad and it's not a problem of linear regression tool? Are there other techniques in order to understand if there is a better model? I'm trying to understand why Linear Regression performs so bad.

OK i plotted the features with respect to the label and i think that the problem is the dataset.enter image description here
The plot with the green X are the features i decided to drop, obtaining an average training error of 4200 (against the 22000 of before). Honestly i don't know what to do now.

Best Answer

There are lots of reasons why linear regression may perform "so bad". A linear regression model may in fact be appropriate but there is a lot of noise in the data. In other words, the explanatory variables that you have simply don't explain enough of the variation in the response. There may be non-linear associations, which could be modelled with linear model (by including non-linear terms in the model or by using an additive model) - alternatively a non-linear model may be more appropriate. There may be interactions among the explanatory variables.

To investigate further, you could plot the response variable against each of the explanatory variables in turn - this may indicate non-linearities, or indeed confirm that the linear model might be appropriate.

Also, before throwing away the model on the basis of $R^2$ (which is generally not a good thing to do) you should perform the usual regression diagnostics such as inspecting residual plots.