[Math] How to find if the probability of the sample proportion is greater than something

probability distributionsstatistics

I have this problem and I have no clue how to solve it.
In 2012, 31% of the adult population in the US had earned a bachelor’s degree or higher. One hundred people are randomly sampled from the population. What is the probability that the
sample proportion p-hat is greater than 0.40?

Best Answer

DeepSea's answer is a pretty standard way to do this. But it's important to remember, that treating $\hat p$ as Normally distributed is an approximation (albeit a quite good one most of the time).

If you wanted to have a more "exact" answer, you could use the Binomial distribution.

Let $Y$ be the number of people in the sample who have earned a Bachelors or higher. Then $Y \sim Binom(100, 0.31)$ and $\hat P = \frac{Y}{n}$. Therefore:

\begin{equation} P(\hat p > .40) = P(Y > 40) = 1 - F_Y(40) = 0.0218 \end{equation}

Notice that DeepSea's answer gives a good approximation. But this is answer is more "exact".

Related Solutions

Statistics – How Does Accuracy of a Survey Depend on Sample Size and Population Size?

Note: For convenience only I use in the following $N$ for the size of the population and $n$ for the sample size.

In order to answer OPs questions we start with some preliminary work and describe the current situation in somewhat more detail.

Current situation:

Here we have a simple random sampling, meaning that every possible combination of $n$ units from a population of size $N$ is equally likely to be the sample selected.

We are in a sampling situation where the object is to estimate the proportion of units in a population having some attributes. In such a situation, the variable of interest is an indicator variable: $y_i=1$ if unit $i$ has the attribute, and $y_i=0$ if it does not.

Writing $p$ for the proportion in the population of size $N$ with the attribute \begin{align*} p=\frac{1}{N}\sum_{i=1}^{N}y_i=\mu \end{align*} the finite population variance is \begin{align*} \sigma^2&=\frac{\sum_{i=1}^{N-1}(y_i-p)^2}{N-1}=\frac{\sum_{i=1}^{N-1}y_i^2-Np^2}{N-1} =\frac{Np-Np^2}{N-1}\\ &=\frac{N}{N-1}p(1-p) \end{align*} Now letting $\hat{p}$ denote the proportion in the sample of size $n$ with the attribute \begin{align*} \hat{p}=\frac{1}{n}\sum_{i=1}^n{y_i}=\bar{y} \end{align*} the sample variance is \begin{align*} s^2&=\frac{\sum_{i=1}^{n-1}(y_i-\bar{y})^2}{n-1}=\frac{\sum_{i=1}^{n-1}y_i^2-n\hat{p}^2}{n-1}\\ &=\frac{n}{n-1}\hat{p}(1-\hat{p})\\ \end{align*}

Note the sample proportion is the sample mean of a simple random sample, it is unbiased for the population proportion and has variance \begin{align*} \mathop{var}(\hat{p})=\frac{N-n}{N-1}\cdot\frac{p(1-p)}{n}\tag{1} \end{align*}

Before we can answer OPs questions we have to do some general

Considerations regarding accuracy:

Suppose that one wishes to estimate a population parameter $\theta$ - for example the population mean or total or proportion of an attribute of the units of the population with an estimator $\hat{\theta}$. Then we would wish the estimate to be close to the true value with high probability.

So, specifying a maximum allowable difference $d$ between the estimate and the true value and allowing for a small probability $\alpha$ that the error may exceed that difference, the challenge is to choose a sample size $n$ such that \begin{align*} P(|\hat{\theta}-\theta|>d)<\alpha\tag{2} \end{align*} If the estimator $\hat{\theta}$ is an unbiased, normally distributed estimator of $\theta$, then $\frac{\hat{\theta}-\theta}{\sqrt{\mathop{var}(\hat{\theta})}}$ has a standard normal distribution. Letting $z$ denote the upper $\frac{\alpha}{2}$ point of the standard normal distribution yields \begin{align*} P\left(\frac{|\hat{\theta}-\theta|}{\sqrt{\mathop{var}(\hat{\theta})}}>z\right) =P\left(|\hat{\theta}-\theta|>z\sqrt{\mathop{var}(\hat{\theta})}\right)=\alpha \end{align*}

Now, since $d$ and the expression (2) provide us with a precise idea of accuracy, we are ready to harvest.

Observe, that the variance of the estimator $\hat{\theta}$ decreases with an increasing sample size $n$, so that the inequality above will be satisfied if we can choose $n$ large enough to make \begin{align*} z\sqrt{\mathop{var}(\hat{\theta})}\leq d\tag{3} \end{align*}

These are the relevant parameters to deal with accuracy. Next we consider

Sample size $n$ for estimating a proportion:

To obtain an estimator $\hat{p}$ having probability at least $1-\alpha$ of being no farther then $d$ from the population proportion, the sample size formula based on the normal approximation gives according to (1) and (3) \begin{align*} \mathop{var}(\hat{\theta})&=\frac{d^2}{z^2}\\ \frac{N-n}{N-1}\cdot\frac{p(1-p)}{n}&=\frac{d^2}{z^2}\\ \end{align*} We obtain by setting $n_0=\frac{z^2}{d^2}p(1-p)$ \begin{align*} n=\frac{1}{\frac{N-1}{N}\cdot\frac{1}{n_0}+\frac{1}{N}}\tag{3} \end{align*}

Note that the formula depends on the unknown population proportion $p$. Since no estimate of $p$ is available, a worst-case value of $p=\frac{1}{2}$ can be used in determining the sample size. This approach is justified since the quantity $p(1-p)$, and hence the value of $n$ assumes its maximum value when $p=\frac{1}{2}$.

Note: When $N$ is large compared with the sample size $n$ then formula (3) reduces to

\begin{align*} n&\simeq \lim\limits_{N\rightarrow \infty}\frac{1}{\frac{N-1}{N}\cdot\frac{1}{n_0}+\frac{1}{N}}=n_0 \end{align*} Since then $n=n_0$ we obtain \begin{align*} n=\frac{z^2}{d^2}p(1-p)\tag{4} \end{align*} and we see in accordance with OPs lecturer, that in case the sample size $n$ is small compared with the population size the accuracy $d$ depends on the sample only.

With regard to one of OPs questions I'm not aware of a specific term for this circumstance. But, sometimes this is named finite population correction.

Which scenario is more accurate:

To answer this question we now transform (3) to obtain the difference $d$

\begin{align*} d=z\sqrt{\frac{N-n}{(N-1)n}p(1-p)} \end{align*}

Assuming an estimation for the true proportion with probability $0.95$ ($\alpha=0.05$) and taking the worst-case probability $p=0.5$ we obtain the formula

\begin{align*} d=1.96\sqrt{\frac{N-n}{(N-1)n}\cdot\frac{1}{2}\cdot\frac{1}{2}}=0.98\sqrt{\frac{N-n}{(N-1)n}} \end{align*}

We observe in case 1: $N=1000, n=100$ \begin{align*} d=0.98\sqrt{\frac{900}{999\cdot100}}\simeq 0.0930 \end{align*} and in case 2: $N=100000, n=1000$ \begin{align*} d=0.98\sqrt{\frac{999000}{999999\cdot1090}}\simeq 0.0310 \end{align*}

and conclude, that the accuracy of case 2 is greater than that of case 1 provided the interpretation is according to the modeling above.

Note: This answer is mostly based upon Sampling, chapter 5: Estimating Proportions, Ratios and Subpopulation Means by Steven K. Thompson.

[Math] The distribution of sample proportion for given population proportion and sample size

Since $np > 10$ we can apply Central Limit Theorem and use the formula $\mathbb{P}\!\left(p < \hat{p}\right)= \mathbb{P}\!\left(Z<\frac{p-\hat{p}}{s}\right)$ where $s = \sqrt{\frac{p(1-p)}{n}}$ and $Z \sim N(0,1)$.

My answer came out to be .6052 (I didn't use a standard normal table).

Best Answer

Related Solutions

Statistics – How Does Accuracy of a Survey Depend on Sample Size and Population Size?

[Math] The distribution of sample proportion for given population proportion and sample size

Related Question