Cumulative Distribution Function – Finding Cumulative Distribution Functions and Merging Them: Step-by-Step Guide

cumulative distribution functionmathematical-statisticsrself-study

I made up a data set with n=314, mean =14.27854, standard deviation =2.16547 using p <-rnorm(314,14.27854, 2.16547). Now, I want to compare theoretical cumulative distribution with the empirical cumulative distribution by drawing both in the same graph in R. I thought that I can find the empirical cumulative distribution using ecdf(p), but I could not find a way for theoretical cumulative distribution. Moreover, I am a beginner in R, and I do not know how to show these two graphs in one graph.

Firstly, is ecdf(p) code correct?

Secondly, how can I find theoretical cumulative distribution?

Thirdly, how can I show them together in one graph in R?

Best Answer

In order of your questions:

  1. Yes, that is a correct way, but there may be a better way for your particular problem
  2. Assuming your mean and SD are the true population statistics, you can generate the proper values using pnorm.
  3. It depends on what you want. Do you want to show the two CDFs overlayed? They will probably be very close and overwrite each other as 314 observations should make for a decent Gaussian sample. Or do you want to plot one against the other like a qqplot?

Here is some code based on your questions which I believe should help:

# Mean
m <- 14.27854

# SD
s <- 2.16547

# Obs
n <- 314

# Set seed for repeatibility
set.seed(45L)

# Generate observations
A <- rnorm(n, m, s)

# Manually create CDF table by sorting the empirical observations and using the
# convention that the points are plotted at the END (so first observation starts
# at 1 / 314, etc.)
empCDF <- data.frame(x = sort(A), p = seq_len(n) / n)


# True CDF applied to observations. empCDF$x is the sorted A's
trueCDF <- pnorm(empCDF$x, m, s)

# Overplot CDFs and against each other
par(mfrow = c(1L, 2L))
plot(empCDF$x, empCDF$p, type = 'l')
lines(empCDF$x, trueCDF, type = 'l', col = 'blue')
plot(trueCDF, empCDF$p, type = 'l')
abline(0,1)
par(mfrow = c(1L, 1L))

CDF and QQPlot

Running more simulations would provide a better fit. Here is the result of the exact same code using $n = 10,000$.

Second set of plots

Update

To show how using 10,000 observations makes the results very close, I will redo the plots with two line types and thicker lines to show they are both there. I will also change the empirical to red for contrast. The blue will remain the true CDF.

# Overplot CDFs and against each other
# Split screen into two windows next to each other: 1 row and 2 columns
par(mfrow = c(1L, 2L))
# Plot the empirical first in red
plot(empCDF$x, empCDF$p, type = 'l', col = 'red')
# Add (lines adds to existing plot) the true value in thick dashed blue
lines(empCDF$x, trueCDF, type = 'l', col = 'blue', lwd = 3L, lty = 3L)
# Now in the second window, plot the empirical against the true
plot(trueCDF, empCDF$p, type = 'l')
# Going forward, make R use one window per plot as usual
par(mfrow = c(1L, 1L))

enter image description here