Solved – Data cleaning for large sample data set in multiple linear regression

large datamultiple regressionregression

I have 70,000 observations for my dependent variable. I have 12 independent variables. After removing zero value and error and missing value form my data set, my data reduced to 4000. Can I still do the multiple linear regression with this data set? I think 4000 data is more than enough for 12 independent variables, but I am not sure whether removing almost 90% of observations will harm my regression or not?

Best Answer

We'd probably need to know more about the nature of missing and the design of the study.

Generally, if the missing pattern is random, then your regression of n=4000 would still be representative. However, if the missing is associated with both outcome and exposure, then they will become confounders that are unaccounted for. In that case, even you have 4000 and only 12 independent variables, your regression results will very likely be off, over even plain misleading.

Having said that, you really need to explain why such a drastic cut. Some research designs invite a lot of missing. For instance, online questionnaires with price draw usually have this magnitude of missing. Most online respondents may just enter the survey, click through all questions without answering, and leave their e-mail to enter to lucky draw. Some other, like face-to-face interview, should never have missing this prominent.

If it's secondary data, then I'd recommend you to consult their study design documentation. Some study would only take a subset for further investigation, and may create an illusion that the others are all missing. For instance, a health study may collect all height and weight of the participants, but only random selects 10% of them for a blood test due to cost.

Studying the original questionnaire may also help. Some data may record N/A as missing. If you have accidentally chosen a question after a certain skip pattern, you may lose a lot of sample. For instance, there could be a question asking if the respondent had tried crack cocaine, and if yes, then there are a few more follow-up questions. If you have picked one of those follow up questions, big time missing can happen.

Based on what the nature is, you can address them differently in your report. But how and what to say about this problematic missing rate would depend on your study and the questions.

Related Question