Solved – Getting different results when plotting 95% CI ellipses with ggplot or the ellipse package

confidence intervalggplot2rscatterplot

I want to visualize the results of a clustering (produced with protoclust{protoclust}) by creating scater plots for each pair of variables used for classifying my data, colouring by classes and overlapping the ellipses for the 95% confidence interval for each of the classes (to inspect which elipses-classes overlap under each pair of variables).

I have implemented the drawing of the ellipses in two different ways and the resulting ellipses are different! (bigger ellipses for first implementation!)
A priori they differ only in size (some diferent scaling?), as the centers and angle of axes, seem to be the similar in both.
I guess I must be doing something wrong by using one of them (hope not with both!), or with the arguments.

Can anyone tell me what I am doing wrong?

Here the code for the two implementations; both are based on the answers to How can a data ellipse be superimposed on a ggplot2 scatterplot?

### 1st implementation 
### using ellipse{ellipse}
library(ellipse)
library(ggplot2) 
library(RColorBrewer)
colorpal <- brewer.pal(10, "Paired")

x <- data$x
y <- data$y
group <- data$group
df <- data.frame(x=x, y=y, group=factor(group))

df_ell <- data.frame() 
for(g in levels(df$group)){df_ell <- rbind(df_ell, cbind(as.data.frame(with(df[df$group==g,], ellipse(cor(x, y),scale=c(sd(x),sd(y)),centre=c(mean(x),mean(y))))),group=g))} 

p1 <- ggplot(data=df, aes(x=x, y=y,colour=group)) + geom_point() + 
  geom_path(data=df_ell, aes(x=x, y=y,colour=group))+scale_colour_manual(values=colorpal)

### 2nd implementation 
###using function ellipse_stat() 
###code by Josef Fruehwald available in: https://github.com/JoFrhwld/FAAV/blob/master/r/stat-ellipse.R

p2 <-qplot(data=df, x=x,y=y,colour=group)+stat_ellipse(level=0.95)+scale_colour_manual(values=colorpal)

Here is the two plots together (left graph is p1 implementation (ellipse()):

enter image description here

The data are available here: https://www.dropbox.com/sh/xa8xrisa4sfxyj0/l5zaGQmXJt

Best Answer

You're not doing anything wrong, the two functions are making different underlying assumptions about the distribution of the data. Your first implementation is assuming multivariate normal, and the 2nd a multivariate t-distribution (see ?cov.trob in package MASS). The effect is easier to see if you pull out one group:

#pull out group 1
pick = group ==1
p3 <- qplot(data=df[pick,], x=x, y=y)
tl = with(df[pick,], 
     ellipse(cor(x, y),scale=c(sd(x),sd(y)),
             centre=c(mean(x),mean(y))))
p3 <- p3 + geom_path(data=as.data.frame(tl), aes(x=x, y=y))
p3 <- p3 + stat_ellipse(level=0.95)
p3 # looks off center
p3 <- p3 + geom_point(aes(x=mean(x),y=mean(y),size=2,color="red"))
p3

So although it is close to the same center and orientation they are not the same. You can come close to the same size ellipse by using cov.trob() to get the correlation and scale for passing to ellipse(), and using the t argument to set the scaling equal to an f-distribution as stat_ellipse() does.

tcv = cov.trob(data[pick,2:3],cor=TRUE)
tl = with(df[pick,], 
          ellipse(tcv$cor[2,1],scale=sqrt(diag(tcv$cov)),
                  t=qf(0.95,2,length(x)-1),
                  centre=tcv$center))
p3 <- p3 + geom_path(data=as.data.frame(tl), aes(x=x, y=y,color="red"))
p3

but the correspondence still isn't exact. The difference must be arising between using the cholesky decomposition of the covariance matrix and creating the scaling from the correlation and the standard deviations. I'm not enough of a mathematician to see exactly where the difference is.

Which one is correct? That's up to you to decide! The stat_ellipse() implementation will be less sensitive to outlying points, while the first will be more conservative.

Related Solutions

Solved – Plotting multiple binhex with the same z levels

Have you thought about using the faceting capabilities within ggplot2 directly? You can allow for free scales as a parameter in the call to facet_wrap(). Here's an example for you:

library(ggplot2)
library(hexbin)
#Sample data.frame.
dat <- data.frame(
  rating = sample(500:2500, 1000, TRUE)
  , Solar.Radiation = sample(0:1200, 1000, TRUE)
  , Ambient.Temperature = sample(-10:25, 1000, TRUE)
  )

#Melt data.frame for plotting
dat.m <- melt(dat, id.vars = "rating")

#Plotting
ggplot(dat.m, aes(value, rating, fill = ..density..)) +
  stat_binhex(na.rm = TRUE) +  opts(aspect.ratio = 1) +
  facet_wrap(facets = ~  variable, scales = "free_x")

Solved – Visualizing PCA in R: data points, eigenvectors, projections, confidence ellipse

There is not much to answer here. You seem to have had some problems with your script that are by now fixed. There is currently nothing wrong with your visualization and in fact I find it a very nice and adequate illustration.

To answer your remaining questions:

The slopes of your principal axes will always be $1$ and $-1$ for a standardized two-dimensional dataset (i.e. if you are working with a correlation matrix), as @whuber said in the comments. See my answer here: Does a correlation matrix of two variables always have the same eigenvectors?
The ellipse that you plotted (according to my understanding of the source code of stat_ellipse()) is a 95% coverage ellipse assuming multivariate normal distribution. This is a reasonable choice. Note that if you want a different coverage, you can change it via level input parameter, but 95% is pretty standard and okay.

Best Answer

Related Solutions

Solved – Plotting multiple binhex with the same z levels

Solved – Visualizing PCA in R: data points, eigenvectors, projections, confidence ellipse

Related Question