Solved – How to deal with Z-score greater than 3

normal distribution

In a standard normal distribution how do I deal with a $Z$ value greater than 3?

I know that z-score ranges form -3 to 3

Consider this one …

mean = 70, standard deviation = 4

I need to find $P(65 < X < 85)$.

Transforming to standard normal gives $P(-1.25 < Z < 3.75)$

How to deal with the $3.75$?

Edit *Actually it's not greater than 3, $z < 3.75$. I meant smaller than, sorry. Should I assume it's just 0.4990 or what?

Best Answer

Let me repeat (and correct) what I've said in my comment ad reply to your edit.

You have to transfer from $X$ to $Z$ in order to use a z-score table. Since a z-score table contains a small finite subset of values, you often must settle for an approximation. So you could also settle for $P(Z<-3)\approx 0$ and $P(Z< 3)\approx 1$ (NB: $P(Z>3)\approx 1$ was a typo, sorry.)

As to $P(-1.25<Z<3.75)$, I'll use this z-score table: $$P(-1.25<Z<3.75)=P(Z<3.75)-P(Z<-1.25)\approx 1-0.1056=0.8944$$

Related Solutions

Solved – How to Calculate a Z-Score from Power Log Distributions

It doesn't matter so much that the Z score is often compared to a symmetrical normal distribution. The key thing about your proposed approach is that it will give you a positive value when the partner has an "above mean" (common sense term = "above average") number of repeat visits, pages per visit, or time per page. So long as you are aware that this is what it is doing, it's not necessarily a bad approach.

You might want to consider alternative cut off points - for example, the median of each of these variables is likely to be lower than the mean; if you used this as the cut-off point instead you would be getting the best half of partners against each criterion. However, any cut-off point is arbitrary and its use depends on whether the results are practicable (ie does it give you a reasonable number of partners to use).

So the short answer to your question is - there is no problem with using these Z scores even when the underlying distribution is skewed. Just be aware that a positive Z score means that partner has a value higher than the mean for that variable, nothing else. And the mean is susceptible to outliers ie a single partner with a squillion repeat visits will result in the mean being so high that only that partner makes your list. So watch out for that problem and consider using another cut-off (median, or 75th percentile) instead. Ultimately, the answer depends on your business drivers.

The next step up in analytical techniques is to find a single criterion against which to rank partners, which somehow takes into account all three of the variables you are interested in. A common naive way to go about this is to take averages of standardised scores; more sophisticated alternatives are to use principal components analysis or factor analysis. But this takes us away from the actual question.

My most important tip - use graphical techniques, particularly scatterplots showing two variables at a time with a point representing each partner; ideally with the more interesting points neatly labelled for you (the number one feature lack of Excel, unfortunately). A "scatterplot matrix" is a handy technique if you have the software to do it easily.

Solved – Finding probability that the total of values in a sample is greater than a particular value

The question assumes $X_i \stackrel{\mathrm{i.i.d.}}{\sim} \mathcal{N}\left(675, 21^2\right)$ and asks you about the distribution of $Y \equiv \sum_{i=1}^{30} X_i$. In particular, you'd like to know $\Pr\left[Y < 20000\right]$.

Under the iid and normality assumptions, we know that $Y \sim \mathcal{N}\left(30 \cdot 675, 30 \cdot 21^2\right)$, so $\Pr\left[Y < 20000\right] \approx 0.0149$.

I got that number in R using

pnorm(20000, mean=20250, sd=sqrt(30) * 21)  # 0.0149

The rest of your question -- the bit where you wonder "what could possibly be meant that we do not need the normal distribution information to answer the question" -- is about the central limit theorem: what if the $X_i$ were still iid, with the same mean and variance as before, but non-normally distributed? How would their sum $Y$ be distributed? The CLT tells you that in the limit, as $n$ goes to infinity, their sum will be normally distributed, assuming the variance of the $X_i$ is finite. Look up the Lindeberg–Lévy CLT.

In this case $n$ equals 30, but it turns out that's already large enough for the normal approximation to be useful. Here are some examples in R:

simulate_normal <- function(n_bags=30, cutoff=20000) {
    return(sum(rnorm(30, mean=675, sd=21)) < cutoff)
}
mean(replicate(10^5, simulate_normal()))  # Around 0.0149 -- here the X_i are normal

simulate_uniform <- function(n_bags=30, cutoff=20000) {
    ## Uniform[a, b] has variance (b-a)^2 / 12
    width <- sqrt(21^2 * 12)
    return(sum(runif(30, min=675 - width/2, max=675 + width/2)) < cutoff)
}
mean(replicate(10^5, simulate_uniform()))  # Still around 0.0149 when the X_i are uniform

Even when you let $X_i \stackrel{\mathrm{i.i.d.}}{\sim} \mathcal{U}\left[675-36.37307, 675+36.37307\right]$, the answer using the normal approximation is nearly correct.

I chose those parameters for the uniform distribution so that it would have mean 675 and variance 21^2:

sd(runif(10^5, min=675 - 36.3707, max=675 + 36.3707))  # Around 21

Best Answer

Related Solutions

Solved – How to Calculate a Z-Score from Power Log Distributions

Solved – Finding probability that the total of values in a sample is greater than a particular value

Related Question