R Probability Distribution – How to Graph and Explore Different Solutions

distributionsmathematical-statisticsprobabilityrself-study

I am trying to solve questions in my statistics book. There is a question a want you check me and help for the part iv :

Consider a population of 100 computer, 45 of whom are broken, and the rest are nonbroken.

i) If you select 10 computer at random but with replacement, find the probability that 4 are
broken.

ii) Now assume you select ten people without replacement, find the same probability.

iii) What is the probability distribution that gives probabilities in
i) and what is it in ii)

iv) Using cumulative probability distribution(s) and graph(s) try to show that probabilities >converge between the cases with replacement and without the replacement as the population size
increases relative to size of selection, i.e. sample size.

My answers:

iv-) I cannot do this part

I stuck in part iv , i am beginner in R language so i could not write codes for part iv.Can you help me for it , i need to show it using cumulative probability distribution(s) and graph(s) in R.Moreover , are i ,ii,iii correct ? Thanks in advance

Best Answer

Graphical comment:

For whatever help it may provide, the CDFs of the binomial distribution in (i) and the hypergeometric distribution in (ii) are almost the same, because only 10 of 100 people have been sampled.

A plot of the two CDFs from R is shown below. (The vertical resolution of such a graph is about two decimal places.)

R code for the figure (using base R graphics).

x = 0:10
Bino.CDF = pbinom(x, 10, .45)
Hypr.CDF = phyper(x, 45,55, 10)
hdr = "Binomial (solid blue) and Hypergeometric (dotted) CDFs"
plot(x, Bino.CDF, type="s", col="blue", ylab="CDF", main=hdr)
 points(x, Bino.CDF, pch=20, col="blue")
lines(x, Hypr.CDF, type="s", lwd=3, lty="dotted", col="brown")
 points(x, Hypr.CDF, col="brown")

Related Solutions

Probability Sampling – Determining the Probability of a Unit Being Selected in a Sample

Suppose the population is $a_{1}, a_{2}, \ldots, a_{N}$. We choose a sample of size $n$ from the population using simple random sampling and want to know the probability that a specific element, say $a_{1}$, is in the sample.

First, note that the probability of an element being included in the sample is the same whether the sample is ordered or unordered. (One way to see that is to note that we can change ordered samples into unordered samples by ignoring the ordering. If we know the probability $a_{1}$ is in any of the ordered samples, that must also be the probability that $a_{1}$ is in any of the unordered samples.) So we can find just one probability and get the result for both ordered and unordered cases.

Sampling without replacement (unordered)

There are ${N}\choose{n}$ possible ways to choose an unordered sample of size $n$ from a population of size $N$ without replacement. Crucially, we also know that each one of those options are equally likely. How many ways are there to choose a sample containing $a_{1}$? If we must include $a_{1}$ then we have to choose the remaining $n-1$ elements from the remaining $N-1$ possibilities, which gives ${N-1}\choose{n-1}$ possible samples. Because we know that each possibility is equally likely, we can find the probability of choosing $a_{1}$ by taking the ratio

$$\frac{{N-1}\choose{n-1}}{{N}\choose{n}} =\frac{n}{N}$$

Sampling with replacement (ordered)

There are $N^n$ possible samples and all of them are equally likely. Finding the number of possible samples that contain $a_{1}$ is a bit more tricky, but can be done by finding the complement instead. If we find the number of samples that don't contain $a_{1}$ and subtract that from $N^n$, we will have the number of samples that do contain $a_{1}$. That can be done since the number of samples that don't contain $a_{1}$ is $(N-1)^n$, giving the number of samples that do contain $a_{1}$ as $N^{n} - (N-1)^n$. This gives the probability of $a_{1}$ being in the sample as

$$\frac{N^{n} - (N-1)^n}{N^{n}} = 1-\left(\frac{N-1}{N}\right)^n$$

By symmetry these probabilities are the same for every $a_{i}$.

Cumulative Distribution Function – Finding Cumulative Distribution Functions and Merging Them: Step-by-Step Guide

In order of your questions:

Yes, that is a correct way, but there may be a better way for your particular problem
Assuming your mean and SD are the true population statistics, you can generate the proper values using pnorm.
It depends on what you want. Do you want to show the two CDFs overlayed? They will probably be very close and overwrite each other as 314 observations should make for a decent Gaussian sample. Or do you want to plot one against the other like a qqplot?

Here is some code based on your questions which I believe should help:

# Mean
m <- 14.27854

# SD
s <- 2.16547

# Obs
n <- 314

# Set seed for repeatibility
set.seed(45L)

# Generate observations
A <- rnorm(n, m, s)

# Manually create CDF table by sorting the empirical observations and using the
# convention that the points are plotted at the END (so first observation starts
# at 1 / 314, etc.)
empCDF <- data.frame(x = sort(A), p = seq_len(n) / n)


# True CDF applied to observations. empCDF$x is the sorted A's
trueCDF <- pnorm(empCDF$x, m, s)

# Overplot CDFs and against each other
par(mfrow = c(1L, 2L))
plot(empCDF$x, empCDF$p, type = 'l')
lines(empCDF$x, trueCDF, type = 'l', col = 'blue')
plot(trueCDF, empCDF$p, type = 'l')
abline(0,1)
par(mfrow = c(1L, 1L))

Running more simulations would provide a better fit. Here is the result of the exact same code using $n = 10,000$.

Update

To show how using 10,000 observations makes the results very close, I will redo the plots with two line types and thicker lines to show they are both there. I will also change the empirical to red for contrast. The blue will remain the true CDF.

# Overplot CDFs and against each other
# Split screen into two windows next to each other: 1 row and 2 columns
par(mfrow = c(1L, 2L))
# Plot the empirical first in red
plot(empCDF$x, empCDF$p, type = 'l', col = 'red')
# Add (lines adds to existing plot) the true value in thick dashed blue
lines(empCDF$x, trueCDF, type = 'l', col = 'blue', lwd = 3L, lty = 3L)
# Now in the second window, plot the empirical against the true
plot(trueCDF, empCDF$p, type = 'l')
# Going forward, make R use one window per plot as usual
par(mfrow = c(1L, 1L))

Best Answer

Related Solutions

Probability Sampling – Determining the Probability of a Unit Being Selected in a Sample

Cumulative Distribution Function – Finding Cumulative Distribution Functions and Merging Them: Step-by-Step Guide

Update

Related Question