Solved – Is exploratory data analysis (EDA) actually needed / useful

exploratory-data-analysislarge data

There are many guides prevalent on the internet about EDA and how everyone should do it and how useful it is however I rarely see it in practice and often times (in said tutorials) it sticks to very basic things.

  1. Dimensions of data
  2. Plotting distributions of features
  3. Linear correlation among features
  4. Missing data (interpolating, dropping etc.)

I haven't often seen (with my limited sample size) that people actually do this in practice, especially on larger datasets where features range to hundreds-thousands, some of the above EDA techniques seem as more of a hinderance than help. Am I really expected to look at hundreds of plots of feature distributions for example?

I am not a formerly trained data scientist and I am still learning. I would like to add this tool to my toolkit, but aside from contrived examples on the internet, I have rarely found with real datasets that such techniques are useful to begin with. I normaly find myself in a circle, where I look a bit at my data, make some assumptions about what is useful and move on to modelling it. If / when something doesn't work, I normally have a better idea of which parts of the data to look at, saving me time when dealing with big datasets with hundreds of features.

If anyone can recommend a resource where I could improve my working / applied knowledge in this area it would also be appreciated. I realise this question is more of a soft question but I do feel it is important to clarify. I hope in its current format it can be seen as a question that can be given a definitive answer.

Best Answer

I come from a traditional biostatistics/epidemiology background, and EDA are definitely useful, although it doesn't mean doing histograms/correlation plots just for the sake of it. With the preeminence of machine learning and prediction, I do feel that it is practiced less and less often these days though.

If you are in medical statistics/epidemiology, then you are usually presented with "rectangular" datasets, i.e. datasets where your rows correspond to individual participants, and columns are variables (features in machine learning terms). You typically only focus on the variables that are relevant to your questions, and that generally won't be more than a dozen or so. It is of course possible that you have more. For example, you may have data collected over time, or biomarkers, or even genetic data. In these cases, you will need to find out the best practices for dealing with these data first. Often this will involve some kind of dimension reduction or summarization. What we emphatically don't do is to just throw everything into a machine learning model and see what predictions it generates. In other words, there's a strong emphasis on understanding your model.

Given the emphasis on understanding the model, EDA is indispensable in that it helps us to identify reasons for various unexpected behaviour or bias in our model fitting. For example, there may be one variable you expect to be very important, and it turned out that it wasn't. You look at the histogram, and you see that the vast majority of it were 0. Or Likewise, there may be patterns in missing data, and you need to understand them and how they may bias your results.

In summary, EDA is not something you do before your main analysis and forget about. It's something you keep doing together with your main analysis, to try and understand the picture better.