Solved – Probability distribution for a proportion based on (continuous) quantities

beta distributionbinomial distributiondistributionsestimationproportion;

I have a problem related with probability distributions and parameter estimation, which comes from a real case. I would be very grateful if you could help me.

Let us suppose that we have a continuous amount $M$ of a given product in which the proportion of a certain target component is $p$, where $p$ is supposed to be fixed and unknown. For example, $M=300$ kilograms of product, and $p$ is the proportion of water in it.

We can assume that the target component is sort of randomly distributed within the product.

We are interested in estimating $p$. With that purpose in mind, we randomly take a sample of a fixed and known amount $m$ of product. The value of $m$ is quite small compared with $M$, but I do not really know whether or not it is small enough to actually assume that $M$ is somehow equivalent to $+\infty$.

In our real case, we randomly take $m=15$ kilograms of product within an amount of $M=300$ kg of product.

We measure the amount of the target component in these $m$ units of product, and we calculate an estimate of $p$ as $\hat{p}= \text{Quantity of target component in the sample}/m$.

Ideally, $\hat{p}$ would be always equal to $p$ (I mean, for instance, in the case where we were dealing with a perfectly homogeneous liquid or low-density product —this statement is probably not perfectly expressed, as I am not a chemist). However, in practice, when we are speaking of a product which is solid and is composed of different solid elements with different weights, sizes and so on, but still can be considered to be continuous (I mean, it is not measured in discrete units), $\hat{p}$ is not necessarily equal to $p$. I hope that is clear enough.

My question is: In this context, what is the probability distribution of $\hat{p}$?

Or, equivalently, how can I calculate the probability $\Pr(\hat{p}\le c \;|\; p)$?

If this was a discrete case, I know that $p$ would be the probability associated to a binomial distribution (assuming $M$ is large enough). But I do not know how to deal with this continuous proportion case. Could you please give me some help?

EDIT [April 27th, 2015]:

As I've already said in the comments below, the point is: What (from a physical/technical point of view) makes the observed proportion $\hat{p}$ in the sample of $m$ kilograms not to be always equal to the real one $p$?, and how does it make it? And I do not have a clear answer for that.

The concrete context in which this problem arises is the following: we have a huge amount of paper or paperboard (in relative small pieces) that has been selected from urban waste. That amount of paper contains foreign elements that cannot actually be treated as paper or paperboard.

We select 15 kg of that big package of paper and we measure the amount of strange elements in the sample. And we use this to estimate the real proportion of foreign elements in the big package. The selection of those 15 kg is made as random as possible.

Of course, there are several sources of variability in this process, both in the waste separation process (the process that recovers paper from the urban waste) and in the sample selection. That is why I am not addressing here any data modelling problem, but just looking for a reasonable way to theoretically determine the sampling distribution of $\hat{p}$.

Even in the case the selection of 15 kg was perfectly randomly executed, the fact is that foreign elements are distributed in a way that makes that not any sampled amount of 15 kg has exactly the same amount of those foreign elements. Why…? I'll try to think about this.


EDIT:

According to the description of the 'proportion' tag in this site, my question could be related with the beta distribution. However, I am still not sure about whether or not the situation I have described meets the beta model assumptions, if any.

EDIT:

Should I actually post this question in https://math.stackexchange.com/???


EDIT [April 24th, 2015]:

Based on a comment by @sesqu in this other thread, I deduced that

$\hat{p} \sim \mathrm{Beta}(m p + 1, m(1-p)+1)$,

where (just to summarize)

  • $p$ is the real (fixed and unknown) proportion of a certain target component in a given infinity amount of product,

  • $m$ is the sample quantity of product that we randomly extract from the total amount of product in order to estimate $p$,

  • $\hat{p}$ is an estimation of the real proportion $p$ which is calculated as $\hat{p}= \text{Quantity of target component in the sample}/m$.

Does it make sense?

EDIT [April 26th, 2015]:

As @Scortchi pointed out in a comment to this post, the previous formula seems not to make sense, as it depends on the units $m$ is measured.


EDIT [April 26th, 2015]:

Although I obviously have real data, I would like to point out that they are very likely to come from a mixture of populations. This study about the mixture still has to be done, maybe with ANOVA. But, regardless the results that may arise from the ANOVA, the fact is that the real proportion $p$ is quite unstable. Therefore, it is very difficult and unreliable to feet a distribution based in our data.

That is why I want to try a different approach, related just with calculating sort of control limits based on probability theory.

I thought that it was possible to deduce the sampling distribution of $\hat{p}$, assuming $p$ is constant, and I thought it could be somehow related with the beta distribution, as I see my case like a kind of generalization of a discrete proportion (binomial). That is the reason for my question.

I've read all the comments in this thread so far, but I'm still waiting for more contributions.


EDIT [April 27th, 2015]:

As far as I have understood from different sources, the Beta distribution is mostly used to model the behaviour of the probability of a certain event using prior experience or knowledge (for instance, I liked this explanation "What is the intuition behind beta distribution?" very much). It all is also somehow related with the Bayesian approach in the sense that the probability being studied and modelled is the underlying $p$, the one in the population, we can say.

I think my question is a little bit different, in the sense that I am considering the population target probability $p$ as a constant (classical statistics approach) and looking for modelling the sampling distribution of the statistic $\hat{p}$.

I do not mean that I now think that beta is not the solution; I just mean that maybe my problem is not related with the typical use of the beta distribution.

Also, as @Wolfgang pointed out, I think my problem has to do with compositional data. But, how does it help?

Going back to the beta distribution, the only new idea I have about how to model the probability distribution of $\hat{p}$ (in case we suppose it is beta-distributed) is to assume that the mode (or maybe the average) of it will be equal to $p$…

As you can see, I am still looking for a way to theoretically deduce the sampling distribution of $\hat{p}$, just as it is done with other statistics such as $\bar{x}$ and so on in classical statistics.


EDIT [April 27th, 2015]:

I've been thinking again on the possibility of treating this as a binomial distribution, in the sense that I am somehow counting how many units are correct within a (randomly and independently selected) set on $n$ units, and where each unit has the same probability $p$ of being correct, but with an uncountably infinity amount of sampled units.

I have also discovered that the CDF of a binomial distribution can be expressed as the regularized incomplete beta function. More precisely, if I am not wrong,

$F_X(x \;|\; n,p)= I_{(1-p)}(n-x,1+x)$,

where $F_X$ stands for the CDF of a binomial variable with parameters $n$ and $p$, and $I_z(a,b)$ represents the regularized incomplete beta function.

Would it make sense to try to calculate or estimate $\lim_{n \rightarrow +\infty}{F_X(x \;|\; n,p)}$, that is, $\lim_{n \rightarrow +\infty}{I_{(1-p)}(n-x,1+x)}$?

EDIT [April 27th, 2015]:

I have plotted $I_{(1-p)}(n(1-c),1+nc)$ for high values of $n$ and it tends to what it was easy to guess: a CDF function in which 100% of the probability is concentrated in $c=p$, which means: when the sample size $n$ tends to be equal to the whole population sample size, the observed proportion $\hat{p}$ tends to equal $p$. Sorry for suggesting this.

Best Answer

If the water content is homogeneous in the 300 kg product, then there is no variance and the measured water content applies to the whole 300 kg product. If the water content is not homogeneous, a single 15 kg sample taken from one place tells you nothing about the variance over the entire product.

If the distribution of water is random across the product, you could take multiple samples, say 15 1 kg samples ($n=15$) from different parts of the product, which we now define as 300 1 kg portions. Measure the percent water in each sample, compute their mean and standard deviation $s$, and compute the standard deviation of the sampling distribution as $(s/\sqrt{n})$ FPF where FPF, the finite population factor, is $\sqrt{(N-n)/(N-1)}$ and $N$, the finite population size, is 300 chunks of 1 kg.

If the water content is not homogeneous but patterned, as in fat in a hog carcass, then the mean water content can be estimated from the water content of a single sample taken from a specific location and the known pattern.

Related Question