Solved – How to determine which variable goes on the X & Y axes in a scatterplot

data visualizationscatterplot

I am trying to do a scatterplot to see the relationship between literacy and baby mortality. How do I know if literacy is my X axis and baby mortality is my Y axis, or the reverse? How do I determine what goes in the X axis and the Y axis?

Best Answer

If you have a variable you see as "explanatory" and the other one as the thing being explained, then one (very common) convention is to put the explanatory variable on the x-axis and the thing being explained by it on the y-axis.

So, for example, you may be viewing the relationship between literacy and mortality as potentially causal (and thus, clearly explanatory) in that greater literacy might lead to lower mortality.

In that case it would be common to put mortality on the y-axis and literacy on the x-axis.

But it's also possible to conceive of them the other way around (high infant mortality might well affect literacy rates), or with neither being explanatory of the other.

In some cases, if one variable is 'fixed' and the other is 'random', the more common convention is that random one tends to go on the y-axis of the plot.

In some areas the conventions may tend to be flipped around; this is simply the most widespread.

Related Solutions

Solved – What’s a good way to use R to make a scatterplot that separates the data by treatment

large clusters: if overprinting is a problem, you could either use a lower alpha, so single points are dim, but overprining makes more intense colour. Or you switch to 2d histograms or density estimates.

require ("ggplot2")

ggplot (iris, aes (x = Sepal.Length, y = Sepal.Width, colour = Species)) + stat_density2d ()

You'd probably want to facet this...
ggplot (iris, aes (x = Sepal.Length, y = Sepal.Width, fill = Species)) + stat_binhex (bins=5, aes (alpha = ..count..)) + facet_grid (. ~ Species)

While you can procude this plot also without facets, the prining order of the Species influnces the final picture.
You can avoid this if you're willing to get your hands a bit dirty (= link to explanation & code) and calculate mixed colours for the hexagons:

Another useful thing is to use (hex)bins for high density areas, and plot single points for other parts:

ggplot (df, aes (x = date, y = t5)) + 
  stat_binhex (data = df [df$t5 <= 0.5,], bins = nrow (df) / 250) +
      geom_point (data = df [df$t5 > 0.5,], aes (col = type), shape = 3) +
  scale_fill_gradient (low = "#AAAAFF", high = "#000080") +
  scale_colour_manual ("response type", 
    values = c (normal = "black", timeout = "red")) + 
  ylab ("t / s")

enter image description here

For the sake of completeness of the plotting packages, let me also mention lattice:

require ("lattice")

xyplot(Sepal.Width ~ Sepal.Length | Species, iris, pch= 20)
xyplot(Sepal.Width ~ Sepal.Length, iris, groups = iris$Species, pch= 20)
xyplot(Sepal.Width ~ Sepal.Length | Species, iris, groups = iris$Species, pch= 20)

R Data Visualization – How to Extract Information from a Scatterplot Matrix with Large N, Discrete Data, & Many Variables?

I'm not sure if this is of any help for you, but for primary EDA I really like the tabplot package. Gives you a good sense of what possible correlations there may be within your data.

install.packages("tabplot")
tableplot(breast) # gives you the unsorted image below
tableplot(breast, sortCol="class") # gives you a sorted image according to class

unordered plot

Best Answer

Related Solutions

Solved – What’s a good way to use R to make a scatterplot that separates the data by treatment

R Data Visualization – How to Extract Information from a Scatterplot Matrix with Large N, Discrete Data, & Many Variables?

Related Question