Solved – How to draw a probable outcome from a distribution

data visualizationdistributions

I have collected positional data. To visualize the data, I'd like to draw a 'typical' outcome of an experiment.

The data comes from a few hundred experiments, where I identify a variable number of objects at different positions relative to the origin in 2D. Thus, I can calculate the average number of objects, as well as estimate the empirical distribution of the objects. A plot of the 'typical' outcome would then have the average (or possibly mode) number of objects, say, 5. What I'm not sure about is where to position these 5 objects.

To simplify the problem, assume that the data follows a 2D normal distribution. If I were just to randomly draw 5 points from the distribution, I might get one point at [3,3], which would be a very rare outcome, and would thus not reflect the 'typical', or 'average' outcome. However, just drawing 5 points at [0,0] would also not make sense – even though [0,0] is the average position of the objects, 5 overlapping points are not an 'average' outcome of the process, either.

In other words, how can I get a 'likely' draw from a distribution?


EDIT

It looks like I should mention why I don't want to use the usual methods (like a 2D smoothed histogram, or plotting all the many points) to look at the 2D distribution.

  1. The objects (which are vesicles (i.e. little spheres) inside cells) vary in number, size and position (distribution of the distance from the cell center, amount of clustering). I would like to display all these features in one graph. Since there are several hundred cells containing many vesicles each, it is not very useful to combine them all in a single plot. I am well aware that I could use a multipanel graph showing the distributions of all parameters, but this would be a lot less intuitive.
  2. I would like to show a 'typical' cell that shows all the salient features that characterize a specific phenotype. This way, if I want to image a particular phenotype in a mixed population, I know what kind of cell I'm looking for.
  3. I think such a plot would be a cool way to display a lot of information at once, and I just want to try.

Maybe it would be clearer If I said that I want to simulate a likely experimental result based on my measurements?

Best Answer

I also think that it's not clear what you want. But if you want a set of deterministically chosen points, so that they preserve the moments of the initial distribution, you can use the sigma point selection method that applies to the unscented Kalman filter.

Say that you want to select $2L+1$ points that fulfill those requirements. Then proceed in the following way:

$\mathcal{X}_0=\overline{x} \qquad w_0=\frac{\kappa}{L+\kappa} \qquad i=0$

$\mathcal{X}_i=\overline{x}+\left(\sqrt{(\:L+\kappa\:)\:\mathbf{P}_x}\right)_i \qquad w_i=\frac{1}{2(L+\kappa)} \qquad i=1, \dots,L$

$\mathcal{X}_i=\overline{x}-\left(\sqrt{(\:L+\kappa\:)\:\mathbf{P}_x}\right)_i \qquad w_i=\frac{1}{2(L+\kappa)} \qquad i=L+1, \dots,2L$

where $w_i$ the weight of the i-th point,

$\kappa=3-L$ (in case of Normally distributed data),

and $\left(\sqrt{(\:L+\kappa\:)\mathbf{P}_x}\right)_i$ is the i-th row (or column)* of the matrix square root of the weighted covariance $(\:L+\kappa\:)\:\mathbf{P}_x$ matrix (usually given by the Cholesky decomposition)

* If the matrix square root $\mathbf{A}$ gives the original by giving $\mathbf{A}^T\mathbf{A}$, then use the rows of $\mathbf{A}$. If it gives the original by giving $\mathbf{A}\mathbf{A}^T$, then use the columns of $\mathbf{A}$. The result of the matlab function chol() falls into the first category.

Here is a simple example using R

x <- rnorm(1000,5,2.5)
y <- rnorm(1000,2,1)

P <- cov(cbind(x,y))
V0 <- c(mean(x),mean(y))
n <- 2;k <- 1
A <- chol((n+k)*P) # matrix square root

points <- as.data.frame(sapply(1:(2*n),function(i) if (i<=n) A[i,] + V0 else -A[i-n,] + V0))
attach(points)

#mean (equals V0)
1/(2*(n+k))*(V1+V2+V3+V4) + k/(n+k)*V0
#covariance (equals P)
1/(2*(n+k)) * ((V1-V0) %*% t(V1-V0) + (V2-V0) %*% t(V2-V0) + (V3-V0) %*% t(V3-V0) + (V4-V0) %*% t(V4-V0))
Related Question