Cumulative Distribution Function – Finding Cumulative Distribution Functions and Merging Them: Step-by-Step Guide

cumulative distribution functionmathematical-statisticsrself-study

I made up a data set with n=314, mean =14.27854, standard deviation =2.16547 using p <-rnorm(314,14.27854, 2.16547). Now, I want to compare theoretical cumulative distribution with the empirical cumulative distribution by drawing both in the same graph in R. I thought that I can find the empirical cumulative distribution using ecdf(p), but I could not find a way for theoretical cumulative distribution. Moreover, I am a beginner in R, and I do not know how to show these two graphs in one graph.

Firstly, is ecdf(p) code correct?

Secondly, how can I find theoretical cumulative distribution?

Thirdly, how can I show them together in one graph in R?

Best Answer

In order of your questions:

Yes, that is a correct way, but there may be a better way for your particular problem
Assuming your mean and SD are the true population statistics, you can generate the proper values using pnorm.
It depends on what you want. Do you want to show the two CDFs overlayed? They will probably be very close and overwrite each other as 314 observations should make for a decent Gaussian sample. Or do you want to plot one against the other like a qqplot?

Here is some code based on your questions which I believe should help:

# Mean
m <- 14.27854

# SD
s <- 2.16547

# Obs
n <- 314

# Set seed for repeatibility
set.seed(45L)

# Generate observations
A <- rnorm(n, m, s)

# Manually create CDF table by sorting the empirical observations and using the
# convention that the points are plotted at the END (so first observation starts
# at 1 / 314, etc.)
empCDF <- data.frame(x = sort(A), p = seq_len(n) / n)


# True CDF applied to observations. empCDF$x is the sorted A's
trueCDF <- pnorm(empCDF$x, m, s)

# Overplot CDFs and against each other
par(mfrow = c(1L, 2L))
plot(empCDF$x, empCDF$p, type = 'l')
lines(empCDF$x, trueCDF, type = 'l', col = 'blue')
plot(trueCDF, empCDF$p, type = 'l')
abline(0,1)
par(mfrow = c(1L, 1L))

Running more simulations would provide a better fit. Here is the result of the exact same code using $n = 10,000$.

Update

To show how using 10,000 observations makes the results very close, I will redo the plots with two line types and thicker lines to show they are both there. I will also change the empirical to red for contrast. The blue will remain the true CDF.

# Overplot CDFs and against each other
# Split screen into two windows next to each other: 1 row and 2 columns
par(mfrow = c(1L, 2L))
# Plot the empirical first in red
plot(empCDF$x, empCDF$p, type = 'l', col = 'red')
# Add (lines adds to existing plot) the true value in thick dashed blue
lines(empCDF$x, trueCDF, type = 'l', col = 'blue', lwd = 3L, lty = 3L)
# Now in the second window, plot the empirical against the true
plot(trueCDF, empCDF$p, type = 'l')
# Going forward, make R use one window per plot as usual
par(mfrow = c(1L, 1L))

Related Solutions

Solved – What does weighted cumulative frequency distribution mean

Here are a couple of base R suggestions, one for where the weights are integers but not too large and the second for where the weights are simply positive

# example data
df <- data.frame(temp=c(50,20,10,40), weight=c(3,1,4,2))

# unweighted empirical CDF
plot.ecdf(df$temp,
  main="unweighted ecdf")

# weighted empirical CDF if weights are positive integers or counts
plot.ecdf(rep(df$temp, df$weight),
  main="weighted ecdf 1 - using counts")

# weighted empirical CDF if weights are positive 
dfsorted <- df[order(df$temp), ]
dfsorted$cumfreq <- cumsum(dfsorted$weight) / sum(dfsorted$weight)
dfsorted2 <- dfsorted[rep(1:nrow(df), each=2),]
dfsorted2$cumfreq <- c(0,dfsorted2$cumfreq[-2*nrow(df)])
plot(dfsorted2$temp, dfsorted2$cumfreq, type="l",
  main="weighted ecdf 2 - general weights", xlab="temp", ylab="cumfreq")

So the unweighted ecdf looks like

and the first weighted ecdf looks like

and the second weighted ecdf looks like

Solved – Correlations of correlations using 3 data sets

In very broad terms I'd question the value of this. It is easy to concoct examples in which correlations are similar but the relationships between variables are different -- and in which correlations are different but the relationships between variables are similar. I write not only as someone interested in statistics but also as someone whose main applications are with environmental data.

Also, what you are proposing to do doing puts enormous weight on correlations as a catch-all summary measure, which necessarily cannot do justice to nonlinearities, clustering, outliers, etc., which are commonplace with environmental data. An analysis of analyses is not out of the question, but the great risk is that each analysis step is a step away from the data you are trying to understand.

Yet another negative: It is difficult to make sense of your graphs without labelling which correlation is which using the names of the variables. You have presumably 91 correlations, but labelling them all will just be confusing; labelling none of them will just be uninformative.

Suggesting a positive alternative would depend on a deeper acquaintance with your scientific objectives, but if these were my data I would start with a single pooled multivariate analysis of three regions and then see whether regions cluster in some low-dimensional subspace. PCA does indeed spring to mind if your variables are mostly or all measured variables.

You name yourself as a R user, but your graphs look like to me like Excel defaults. I suggest that your graphs should show bounds of $[-1,1]$ on both axes; shift the $x$ axis with its numeric labels away from the middle of the graph; and use open or hollow symbols such as "o" rather than solid symbols.

P.S. In statistics, parameters and variables are not alternative terms. Your parameters are all variables.

Whether you are a student or a professional, you might benefit from finding a friendly local statistician, or someone in your field with more statistical experience, to talk to.

(LATER) If you are determined to do this, an extension of @Dualinity's approach to parallel coordinate plots might help.

Best Answer

Update

Related Solutions

Solved – What does weighted cumulative frequency distribution mean

Solved – Correlations of correlations using 3 data sets

Related Question