[Math] How is salary a binomial distribution

binomial distributionconditional-expectationexpectationnormal distributionprobability

In an online course here, the author presents a problem that salary is normally distributed, provides mean and variance.

As per my understanding, when no of Bernoulli trials are sufficiently large we could approximate binomial distribution to normal distribution.

How can Salary be normally distributed? Or with lesser samples/employees, how is it binomial in nature? I could not imagine how Bernoulli trial (which can have only 2 outcomes like H/T fits here)

Best Answer

Normal distributions can arise in other ways than as the limit of a binomial. In classes we are very prone to assume a normal distribution for some quantity because we have lots of theorems and z score tables that work with it. You should ignore the word salary and think "a random variable with given mean and variance" and prove or compute what you are asked for.

Salaries in particular do not follow a normal distribution. First, every normal distribution has some support below zero, but negative salaries are not realistic. Second, the tails are badly asymmetric. There is a small tail extending a huge number of standard deviations above the mean with many more events than the normal distribution predicts.

Preliminary (TL;DR)

Background

In his 1991 publication, Norman C. Beaulieu answered your question w/ what he dubbed, the generalized multinomial distribution (GMD). My explanation will focus on the GMD's utility.

Notation

# categories $= c$.
# trials $= t$.
Random vector $= X = \left[\begin{array}{cccc}X_1&X_2&\cdots&X_c\end{array}\right]^T$.
Category responses after $t$ trials vector $= x = \left[\begin{array}{cccc}x_1&x_2&\cdots&x_c\end{array}\right]^T$.
- $\sum_{k = 1}^c x_k = t$.
Probability of category response during trial matrix $= p = \left[\begin{array}{cccc} p_{1,1} & p_{1,2} & \cdots & p_{1,c} \\ p_{2,1} & p_{2,2} & \cdots & p_{2,c} \\ \vdots & \vdots & \ddots & \vdots \\ p_{t,1} & p_{t,2} & \cdots & p_{t,c} \end{array}\right]$.
Pmf of $X = P\left[X = x\right]$.
$[c] = \left\{1, 2, \cdots, c\right\}$.
Multiset of $[c] = ([c], m) = \left\{1^{m(1)}, 2^{m(2)}, \cdots, c^{m(c)}\right\}$.
- $m(i) = x_i$.
Permutations of $([c], m) = \mathfrak{S}_{([c], m)}$.
- $card\left(\mathfrak{S}_{([c], m)}\right) = \left(m(1), m(2), \cdots, m(c)\right)!$.

Pmf of GMD

$$P\left[X = x\right] = \sum_{\mathfrak{s} \in \mathfrak{S}_{([c], m)}} \left\{\prod_{k = 1}^t \left\{p_{k,\mathfrak{s}_k}\right\}\right\}$$

So far, I've identified it as being the superclass of 7 distributions! Namely...

Bernoulli distribution.
Uniform distribution.
Categorical distribution.
Binomial distribution.
Multinomial distribution.
Poisson's binomial distribution.
Generalized multinomial distribution (if your definition of superclass allows self-inclusion).

Examples

Games

g1: A 2 sided die is simulated using a fair standard die by assigning faces w/ pips 1 through 3 & 4 through 6 to sides 1 & 2, respectively. The die is biased by etching micro holes into faces w/ pips 1 through 3 s.t. $p_1 = 12/30$ & $p_2 = 18/30$. The 2 sided die is tossed 1 time & the category responses are recorded.
g2: Same as g1, accept w/ ideal standard die, i.e., $p_1 = p_2 = \cdots = p_6 = 5/30$.
g3: Same as g1, accept w/ standard die, i.e., $p_1 = p_2 = p_3 = 4/30$ & $p_4 = p_5 = p_6 = 6/30$.
g4: Same as g1, accept die is tossed 7 times.
g5: Same as g3, accept die is tossed 7 times.
g6: Same as g4, accept the micro holes are filled w/ $0.07$ kg of a material, which evaporates @ $0.01$ kg/s upon being sprayed w/ an activator, s.t. $p_1 = p_2 = 15/30$ for the 1st toss. Immediately after being sprayed, category responses are recorded every second.
g7: Same as g6, accept w/ standard die, i.e., $p_1 = p_2 = \cdots = p_6 = 5/30$ for the 1st toss.

Questions

q1: Find pmf & evaluate when $x = \left[\begin{array}{cc}0&1\end{array}\right]^T$.
q2: Find pmf & evaluate when $x = \left[\begin{array}{cccccc}0&1&0&0&0&0\end{array}\right]^T$.
q3: q2.
q4: Find pmf & evaluate when $x = \left[\begin{array}{cc}2&5\end{array}\right]^T$.
q5: Find pmf & evaluate when $x = \left[\begin{array}{cccccc}0&2&1&1&0&3\end{array}\right]^T$.
q6: q4.
q7: q5.

Answers w/o knowledge of GMD

a1: $X$ ~ Bernoulli distribution.
- $P\left[X = x\right] = t!\prod_{k = 1}^c \frac{p_k^k}{k!} = 1!\prod_{k = 1}^2 \frac{p_k^k}{k!} = \frac{1!(12/30)^0(18/30)^1}{0!1!}$
  $\Longrightarrow P\left[X = x\right] = 3/5$.
a2: $X$ ~ Uniform distribution.
- $P\left[X = x\right] = t!\prod_{k = 1}^c \frac{p_k^k}{k!} = 1!\prod_{k = 1}^6 \frac{p_k^k}{k!} = \frac{1!(5/30)^{0 + 1 + 0 + 0 + 0 + 0}}{0!1!0!0!0!0!}$
  $\Longrightarrow P\left[X = x\right] = 1/6$.
a3: $X$ ~ Categorical distribution.
- $P\left[X = x\right] = t!\prod_{k = 1}^c \frac{p_k^k}{k!} = 1!\prod_{k = 1}^6 \frac{p_k^k}{k!} = \frac{1!(4/30)^{0 + 1 + 0}(6/30)^{0 + 0 + 0}}{0!1!0!0!0!0!}$
  $\Longrightarrow P\left[X = x\right] = 2/15$.
a4: $X$ ~ Binomial distribution.
- $P\left[X = x\right] = t!\prod_{k = 1}^c \frac{p_k^k}{k!} = 7!\prod_{k = 1}^2 \frac{p_k^k}{k!} = \frac{7!(12/30)^2(18/30)^5}{2!5!}$
  $\Longrightarrow P\left[X = x\right] = 20412/78125$.
a5: $X$ ~ Multinomial distribution.
- $P\left[X = x\right] = t!\prod_{k = 1}^c \frac{p_k^k}{k!} = 7!\prod_{k = 1}^6 \frac{p_k^k}{k!} = \frac{7!(4/30)^{0 + 2 + 1}(6/30)^{1 + 0 + 3}}{0!2!1!1!0!3!}$
  $\Longrightarrow P\left[X = x\right] = 224/140625$.
a6: $X$ ~ Poisson's binomial distribution.
- $P\left[\left[\begin{array}{cc}X_1&X_2\end{array}\right]^T = \left[\begin{array}{cc}x_1&x_2\end{array}\right]^T\right] = P\left[X_1 = x_1, X_2 = x_2\right] = P\left[X_1 = x_1\right] = P\left[X_2 = x_2\right]$.
- $p_1$ & $p_2$ are vectors now: $p_1 = \left[\begin{array}{cccc}p_{1_1}&p_{1_2}&\cdots&p_{1_t}\end{array}\right]^T, p_2 = \left[\begin{array}{cccc}p_{2_1}&p_{2_2}&\cdots&p_{2_t}\end{array}\right]^T$.
- $P\left[X_2 = x_2\right] = \frac{1}{t + 1}\sum_{i = 0}^t \left\{\exp\left(\frac{-j2\pi i x_2}{t + 1}\right) \prod_{k = 1}^t \left\{p_{2_k}\left(\exp\left(\frac{j2\pi i}{t + 1}\right) - 1\right) + 1\right\}\right\}$
  $= \frac{1}{8}\sum_{i = 0}^7 \left\{\exp\left(\frac{-j5\pi i}{4}\right) \prod_{k = 1}^7 \left\{\left(\frac{0.5k + 14.5}{30}\right)\left(\exp\left(\frac{j\pi i}{4}\right) - 1\right) + 1\right\}\right\}$
  $\Longrightarrow P\left[X_2 = 5\right] = 308327/1440000$.
a7: $X$ ~ Generalized multinomial distribution.
- ???

Answers w/ Knowledge of GMD

a1: $X$ ~ Bernoulli distribution.
- $p = \left[\begin{array}{c}\frac{12}{30}&\frac{18}{30}\end{array}\right]$.
- $\mathfrak{S}_{([2], m)} = \left\{\left(2\right)\right\}$.
a2: $X$ ~ Uniform distribution.
- $p = \left[\begin{array}{c}\frac{5}{30}&\frac{5}{30}&\frac{5}{30}&\frac{5}{30}&\frac{5}{30}&\frac{5}{30}\end{array}\right]$.
- $\mathfrak{S}_{([6], m)} = \left\{\left(2\right)\right\}$.
a3: $X$ ~ Categorical distribution.
- $p = \left[\begin{array}{c}\frac{4}{30}&\frac{4}{30}&\frac{4}{30}&\frac{6}{30}&\frac{6}{30}&\frac{6}{30}\end{array}\right]$.
- $\mathfrak{S}_{([6], m)} = \left\{\left(2\right)\right\}$.
a4: $X$ ~ Binomial distribution.
- $p = \left[\begin{array}{cc} \frac{12}{30}&\frac{18}{30} \\ \frac{12}{30}&\frac{18}{30} \\ \frac{12}{30}&\frac{18}{30} \\ \frac{12}{30}&\frac{18}{30} \\ \frac{12}{30}&\frac{18}{30} \\ \frac{12}{30}&\frac{18}{30} \\ \frac{12}{30}&\frac{18}{30} \end{array}\right]$.
- $\mathfrak{S}_{([2], m)} = \left\{\left(1,1,2,2,2,2,2\right), \ldots, \left(2,2,2,2,2,1,1\right)\right\}$.
a5: $X$ ~ Multinomial distribution.
- $p = \left[\begin{array}{cccccc} \frac{4}{30}&\frac{4}{30}&\frac{4}{30}&\frac{6}{30}&\frac{6}{30}&\frac{6}{30} \\ \frac{4}{30}&\frac{4}{30}&\frac{4}{30}&\frac{6}{30}&\frac{6}{30}&\frac{6}{30} \\ \frac{4}{30}&\frac{4}{30}&\frac{4}{30}&\frac{6}{30}&\frac{6}{30}&\frac{6}{30} \\ \frac{4}{30}&\frac{4}{30}&\frac{4}{30}&\frac{6}{30}&\frac{6}{30}&\frac{6}{30} \\ \frac{4}{30}&\frac{4}{30}&\frac{4}{30}&\frac{6}{30}&\frac{6}{30}&\frac{6}{30} \\ \frac{4}{30}&\frac{4}{30}&\frac{4}{30}&\frac{6}{30}&\frac{6}{30}&\frac{6}{30} \\ \frac{4}{30}&\frac{4}{30}&\frac{4}{30}&\frac{6}{30}&\frac{6}{30}&\frac{6}{30} \end{array}\right]$.
- $\mathfrak{S}_{([6], m)} = \left\{\left(2,2,3,4,6,6,6\right), \ldots, \left(6,6,6,4,3,2,2\right)\right\}$.
a6: $X$ ~ Poisson's binomial distribution.
- $p = \left[\begin{array}{cc} \frac{15}{30}&\frac{15}{30} \\ \frac{14.5}{30}&\frac{15.5}{30} \\ \frac{14}{30}&\frac{16}{30} \\ \frac{13.5}{30}&\frac{16.5}{30} \\ \frac{13}{30}&\frac{17}{30} \\ \frac{12.5}{30}&\frac{17.5}{30} \\ \frac{12}{30}&\frac{18}{30} \end{array}\right]$.
- $\mathfrak{S}_{([2], m)} = \left\{\left(1,1,2,2,2,2,2\right), \ldots, \left(2,2,2,2,2,1,1\right)\right\}$.
a7: $X$ ~ Generalized multinomial distribution.
- $p = \left[\begin{array}{cccccc} \frac{5}{30}&\frac{5}{30}&\frac{5}{30}&\frac{5}{30}&\frac{5}{30}&\frac{5}{30} \\ \frac{4.8\overline{3}}{30}&\frac{4.8\overline{3}}{30}&\frac{4.8\overline{3}}{30} &\frac{5.1\overline{6}}{30}&\frac{5.1\overline{6}}{30}&\frac{5.1\overline{6}}{30} \\ \frac{4.\overline{6}}{30}&\frac{4.\overline{6}}{30}&\frac{4.\overline{6}}{30} &\frac{5.\overline{3}}{30}&\frac{5.\overline{3}}{30}&\frac{5.\overline{3}}{30} \\ \frac{4.5}{30}&\frac{4.5}{30}&\frac{4.5}{30} &\frac{5.5}{30}&\frac{5.5}{30}&\frac{5.5}{30} \\ \frac{4.\overline{3}}{30}&\frac{4.\overline{3}}{30}&\frac{4.\overline{3}}{30} &\frac{5.\overline{6}}{30}&\frac{5.\overline{6}}{30}&\frac{5.\overline{6}}{30} \\ \frac{4.1\overline{6}}{30}&\frac{4.1\overline{6}}{30}&\frac{4.1\overline{6}}{30} &\frac{5.8\overline{3}}{30}&\frac{5.8\overline{3}}{30}&\frac{5.8\overline{3}}{30} \\ \frac{4}{30}&\frac{4}{30}&\frac{4}{30}&\frac{6}{30}&\frac{6}{30}&\frac{6}{30} \end{array}\right]$.
- $\mathfrak{S}_{([6], m)} = \left\{\left(2,2,3,4,6,6,6\right), \ldots, \left(6,6,6,4,3,2,2\right)\right\}$.
- $P\left[X = x\right] = 59251/36905625$.

Final Words

I know my answer was very long (& went far beyond what OP asked for) but this had been flying around inside my head for quite some time & this q seemed like the most suitable landing strip.

I performed the last 6 calculations using the function gmdPmf (which I defined in Mathematica)...

(* GENERALIZED MULTINOMIAL DISTRIBUTION (GMD) *)
(* Note: mXn = # rows X # columns. *)
gmdPmf[
    x_ (* Responses of category j, after t trials have taken place. *),
    p_ (* Matrix (tXm) holds p_{trial i, category j} = P["Response of trial i is category j"]. *)
] := Module[{t, c, ⦋c⦌, allRPs, desiredRPs, count = 0, sum = 0, product = 1},
    t = Total[x]; (* # trials. *)
    c = Length[x]; (* # categories. *)
    ⦋c⦌ = Range[c]; (* Categories. *)
    allRPs = Tuples[⦋c⦌,t]; (* Matrix (c^tXt) holds all the response patterns given that t trials have occurred. *)
    desiredRPs = {}; (* Matrix ((x_1,x_2,...,x_c) !Xt) holds the desired response patterns; subset of allRPs wrt n. *)

    For[i = 1, i <= Length[allRPs], i++,
        For[j = 1, j <= c, j++, If[Count[allRPs[[i]],⦋c⦌[[j]]] == x[[j]], count++];];
        If[count == c, AppendTo[desiredRPs, allRPs[[i]]]];
        count = 0;
    ];

    For[i = 1, i <= Length[desiredRPs], i++, 
        For[j = 1, j <= t, j++, product *= (p[[j]][[desiredRPs[[i]][[j]]]]);];
        sum += product;
        product = 1;
    ];

    sum
];

(* ANSWERS *)
Print["a1: P[X = x] = ", gmdPmf[{0, 1}, {{12/30, 18/30}}], "."];
Print["a2: P[X = x] = ", gmdPmf[{0,1, 0, 0, 0, 0}, {{5/30, 5/30, 5/30, 5/30, 5/30, 5/30}}], "."];
Print["a3: P[X = x] = ", gmdPmf[{0,1, 0, 0, 0, 0}, {{4/30, 4/30, 4/30, 6/30, 6/30, 6/30}}], "."];
Print["a4: P[X = x] = ", gmdPmf[{2, 5}, ArrayFlatten[ConstantArray[{{12/30, 18/30}}, {7, 1}]]], "."];
Print["a5: P[X = x] = ", gmdPmf[{0, 2, 1, 1, 0, 3}, ArrayFlatten[ConstantArray[{{4/30, 4/30, 4/30, 6/30, 6/30, 6/30}}, {7, 1}]]], "."];
p = {}; For[i = 1, i <= 7, i++, l = ((31/2) - (1/2)*i)/30; r = ((29/2) + (1/2)*i)/30;  AppendTo[p,{l,r}];]; Print["a6: P[X = x] = ", gmdPmf[{2, 5}, p], "."];
p = {}; For[i = 1, i <= 7, i++, l = ((31/6) - (1/6)*i)/30; r = ((29/6) + (1/6)*i)/30;  AppendTo[p,{l,l,l,r,r,r}];]; Print["a7: P[X = x] = ", gmdPmf[{0, 2, 1, 1, 0, 3}, p], "."];

Clear[gmdPmf];

Please edit, if you know of any ways to make it shorter/faster. Congrats, if you made it to the end! (: