Data Visualization – Statistical Methods to Plot Millions of Data Points Efficiently

data visualizationr

I find R can take a long time to generate plots when millions of points are present – unsurprising given that points are plotted individually. Furthermore, such plots are often too cluttered and dense to be useful. Many of the points overlap and form a black mass and a lot of time is spent plotting more points into that mass.

Are there any statistical alternatives to representing large $n$ data in a standard scatterplot? I have considered a density plot, but what other alternatives are there?

Best Answer

This is a hard task with no ready solutions (this is of course because density plot is so a tempting fallback than no one really cares). So, what can you do?

If they really overlap (i.e. have exactly the same X & Y coordinates) and you are not using alpha, the best idea would be just to reduce the overlap using unique (with alpha, it may be summed over such groups).

If not, you may manually round the coordinates to the nearest pixels and use the previous method (yet this is a dirty solution).

Finally, you can make a density plot only to use it to subsample the points in the most dense areas. This on the other hand will not make the exactly same plot and may introduce artifacts if not precisely tuned.