Solved – Best way to visualize scatterplot with thousands of points in a grayscale-friendly way

data visualization

I have 10,000 data points like shown in this plot:
It's comparing the running time of some piece of code with the size of the problem it's running on.
(There are 2 important steps in the code; step 1's running time is in blue and step 2's is in green.)

I'm hoping to keep this grayscale-friendly, because I'm hoping to publish this and it may end up being in grayscale.

I'm trying to figure out how to best visualize this data. Currently I'm thinking it may be best to perform kernel density estimation in log-scale and just plot a smooth surface, but I'm not sure… is there a better way to visualize it clearly?

Best Answer

A log-log plot will spread the points out quite a bit.

If your thesis is correct the data should tend to lie close to/parallel to a 45 degree line through a typical point - say (x-median,y-median).

Having seen your log-log scale plot in the comments, a greyscale would be a problem because the overlap of the point clouds is so substantial even on the log scale. With color you can use transparency but that's difficult on greyscale.

So for that issue, consider a pair of graphs, each with a LOESS curve (as well as the suggested reference line), and each also with the LOESS curve from the other plot as a dashed curve for ready comparison.

Related Solutions

Solved – Using Cleveland dot plots to visualize time-series data

I don't have a copy handy, but I believe in Edward Tufte's The visual display of quantitative information, he suggests that for time series charts the X axis should be reserved for the temporal dimension (simply for familiarity). He also has an example where connecting the lines between the observations one is able to discern periodicity in the observations that would be difficult to detect simply observing the dots.

So I would just suggest a simple line plot, which with your above data could be graphed in R as;

plot(x = 1:11, y = dat$conv, type = "l", xaxt='n')
axis(1, 1:11, as.character(dat$date))

Considering the nature of the data another question suggesting to graph confidence intervals for estimates may be of interest as well.

Solved – What’s a good visualization for Poisson regressions

After you've fit the model, why not use predicted defects as a variable to compare to the others using whatever standard techniques are meaningful to them? It has the advantage of being a continuous variable so you can see even small differences. For example, people will understand the difference between an expected number of defects of 1.4 and of 0.6 even though they both round to one.

For an example of how the predicted value depends on two variables you could do a contour plot of time v. complexity as the two axes and colour and contours to show the predicted defects; and superimpose the actual data points on top.

The plot below needs some polishing and a legend but might be a starting point.

enter image description here

An alternative is the added variable plot or partial regression plot, more familiar from a traditional Gaussian response regression. These are implemented in the car library. Effectively the show the relationship between what is left of the response and what is left of one of the explanatory variables, after the rest of the explanatory variables have had their contribution to both the response and explanatory variables removed. In my experience most non-statistical audiences find these a bit difficult to appreciate (could by my poor explanations, of course).

enter image description here

#--------------------------------------------------------------------
# Simulate some data
n<-200
time <- rexp(n,.01)
complexity <- sample(1:5, n, prob=c(.1,.25,.35,.2,.1), replace=TRUE)
trueMod <- exp(-1 + time*.005 + complexity*.1 + complexity^2*.05)
defects <- rpois(n, trueMod)
cbind(trueMod, defects)


#----------------------------------------------------------------------
# Fit model
model <- glm(defects~time + poly(complexity,2), family=poisson)
# all sorts of diagnostic checks should be done here - not shown


#---------------------------------------------------------------------
# Two variables at once in a contour plot

# create grid
gridded <- data.frame(
    time=seq(from=0, to=max(time)*1.1, length.out=100),
    complexity=seq(from=0, to=max(complexity)*1.1, length.out=100))

# create predicted values (on the original scale)
yhat <- predict(model, newdata=expand.grid(gridded), type="response")

# draw plot
image(gridded$time, gridded$complexity, matrix(yhat,nrow=100, byrow=FALSE),
    xlab="Time", ylab="Complexity", main="Predicted average number of defects shown as colour and contours\n(actual data shown as circles)")
contour(gridded$time, gridded$complexity, matrix(yhat,nrow=100, byrow=FALSE), add=TRUE, levels=c(1,2,4,8,15,20,30,40,50,60,70,80,100))

# Add the original data
symbols(time, complexity, circles=sqrt(defects), add=T, inches=.5)

#--------------------------------------------------------------------
# added variable plots

library(car)
avPlots(model, layout=c(1,3))

Best Answer

Related Solutions

Solved – Using Cleveland dot plots to visualize time-series data

Solved – What’s a good visualization for Poisson regressions

Related Question