Statistical Test Choice – Should a T-Test or Z-Test Be Used?

inferencepythonstatsmodelst-testz-test

I have a scenario where I have an A/B test of a webpage. I'm trying to figure out using statistical analysis if more people spent time (in minutes, which is a float) on page B (the new page), than page A (the old page).

I have formulated the below hypotheses:

Null Hypothesis: Time spent on the new page is equal to the time spent on the old page

Alternative Hypothesis: Time spent on the new page is greater than the time spent on the old page

The characteristics of my two samples (A and B) are:

  • Both samples come from independent populations
  • Both samples were randomly selected
  • Both samples have the same number of values (50)
  • The sample standard deviation can be calculated. I don't know the population standard deviation, and I don't know the size of the population.

I believe the statistical test to use is a two sample z test for comparing means. I am looking to compute in Python, and I know the Python statsmodels library has ztest, which can be imported as

from statsmodels.stats.weightstats import ztest

When I look at the statsmodels page (https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.ztest.html). For reference, below are the parameters the documentation says ztest takes:

statsmodels.stats.weightstats.ztest(x1, x2=None, value=0, alternative='two-sided', usevar='pooled', ddof=1.0)

I'm unclear of the values that go in some of the parameters

  1. Is x1 an array of all the appropriate values (in this case the time samples collected for the new page based on how my null and alternate hypotheses are written)?

  2. Is x2 an array of all the appropriate values (in this case the time samples collected for the old page based on how my null and alternate hypotheses are written)?

  3. What is value and how is it calculated? I read the documentation, and I was unclear.

I do know alternative will be "greater" based on how my alternative hypothesis is written.

Or since I don't explicitly know the standard deviation up front, would this be a 2 sample t test? Otherwise, I would need to look at the Python SciPy library.

Best Answer

In addition to what you say in your first set of 'bullets' I assume that both samples x1 and x2 are from (nearly) normal populations, with equal population variances.

Although some texts and online pages seem to muddle the decision whether to use a two-sample z test or a 2-sample t test, this decision is quite easy to make:

  • if the common population variance $\sigma^2 = \sigma_1^2 = \sigma_2^2$ is known, then use a 2-sample z test. If $\sigma^2$ is unknown and estimated from the samples, then use a 2-sample t test.

  • In addition, you may use a pooled 2-sample t test only if you know that the two populations have the same variance, as assumed above. If you do not know whether the two populations have the same variance, then use a Welch 2-sample t test (which does not assume equal variances).

Here are fictitious samples x1 and x2 both of size $n_1=n_2 = 50,$ sampled using R, from populations with possibly different population means $\mu_1, \mu_2$ but with the same population variance $\sigma^2.$

Here are descriptions of the samples:

summary(x1); length(x1); sd(x1)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   45.03   62.61   71.72   69.53   78.67   86.64 
[1] 50         # sample size
[1] 10.97721   # sample SD

summary(x2); length(x2); sd(x2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  56.73   70.17   76.48   75.78   81.64  107.26 
[1] 50
[1] 10.08054

stripchart(list(x1,x2), pch="|", ylim=c(.5,2.5))

enter image description here

The sample means are $\bar X_1 = 69.53, \bar X_2 = 75.78$ differ. The question is whether they are enough different (relative to sample sizes and variance) to be considered significantly different at the 5% level, in a statistical sense.

Here is output from a two-sample pooled t test in R: Because the P-value $0.004 < 0.04 = 5\%$ they are significantly different. [Note the parameter var.eq=T, to perform a pooled test. Unless otherwise stated, a two-tailed test of the null hypothesis that means are equal is performed.]

t.test(x1, x2, var.eq=T)

        Two Sample t-test

data:  x1 and x2
t = -2.9663, df = 98, p-value = 0.003787
alternative hypothesis: 
 true difference in means is not equal to 0
95 percent confidence interval:
 -10.434612  -2.069356
sample estimates:
mean of x mean of y 
 69.52867  75.78066 

If we did not know that that population variances are equal, we would have done a Welch test, as below, which gives similar results for my fictitious data.

t.test(x1, x2)

        Welch Two Sample t-test

data:  x1 and x2
t = -2.9663, df = 97.297, p-value = 0.003792
alternative hypothesis: 
 true difference in means is not equal to 0
95 percent confidence interval:
 -10.434989  -2.068978
sample estimates:
 mean of x mean of y 
  69.52867  75.78066 

I hope you can figure out how to run pooled and Welch 2-sample t tests using Python. P-values should be almost exactly the same as my results from R.


Note: In an actual application one would not know the exact population parameters, but here is R code used to sample my fictitious data.

set.seed(1213)
x1 = rnorm(50, 70, 10)
x2 = rnorm(50, 75, 10)
Related Question