Solved – Fitting linear model through noisy data

data preprocessingdatasetpredictive-modelsregression

I'm currently working on a predictive modeling project. I have to predict $Y$ given variables $X_1,X_2,X_3$ and $X_4$ that are not necessarily independent. Our first idea was to propose a linear regression model defined as
$$Y = \beta_0+\beta_1 X_1 + \beta_2 X_2+ \beta_3 X_3 + \beta_4 X_4.$$

In my dataset ($10^5$ observations), I have observed that a lot of data is kind of 'grouped'. To clarify 'grouped', I have data $(x_{1i}, x_{2i},x_{3i},x_{4i},y_{i})$ and $(x_{1j},y_{2j},x_{3j},x_{4j},y_{j})$ where
$$x_{1i} = x_{1j}, x_{2i} = x_{2j}, x_{3i} = x_{3j}, x_{4i} \neq x_{4j}, y_i \neq y_j.$$

where $1 \leq i,j \leq 10^5$ and where $x_{kl}$ is the $l$th observation of variable $X_k$ with $k \in \{1,2,3,4\}$.

Hence, a lot of data where $X_1,X_2$ and $X_3$ coincide and where the $X_4$'s and the $Y$'s are relatively different. After fitting the model, the performance was really bad. I believe that this 'grouped' data has a great impact on the goodness of the fit since the model tries to fit as much data as possible leading to overfitting.

Is there some kind of way to deal with this?

Thanks in advance!

Best Answer

If I understand your question correctly, the issue is that X1, X2, and X3 are all highly correlated. That's a problem with multicollinearity among your predictors rather than non-independence in your data (grouping).

There are a number of solutions for this. The simplest solution is to drop redundant variables, if you're okay with that. If X1, X2, and X3 are all highly correlated, then a model that just includes X1 and X4 might be fine. If for some reason you don't want to drop any variables, you can use principal components analysis to separate them into orthogonal components, or use another type of model that handles multicollinearity well like ridge regression. Here's a relevant answer with some useful links: https://stats.stackexchange.com/a/124232/131407

Related Question