Solved – How to compare sub-sample mean with the sample mean

meanstatistical significance

I have a dataset with 983 rows and two columns: Rent and Postcode.

The mean Rent for the entire sample (983 rows) is 817.49.
The mean Rent for the N1 postcode (23 rows) is 887.02.

The N1 data is a subset of my entire sample; is it still possible to compare the means? How would I test whether Rent in N1 is significantly higher than average?

The statistical tests I have come across so far either rely on independent samples (I assume these are not independent as one is a subset of another), or dependent samples measured across time (which these are not).

Best Answer

You need the average separately for the two groups (N1 and Not N1), and from the information you posted the mean for Not N1 can be calculated as $$ \frac{817.4 \cdot 983 - 887.02 \cdot 23}{983-23}. $$ Then you can use the independent samples t-test (or some other test for independent samples).

Related Solutions

Solved – Comparison of average values of data sets

The answer is "Yes". This is Simpson's paradox applied to mean differences instead of odds ratios. You can read Wiki's article (http://en.wikipedia.org/wiki/Simpson%27s_paradox) to understand the mechanisms behind it. It's a projection problem: If you only see a two dimensional projection of a three dimensional object, you can get quite a wrong impression about the whole picture. In balanced settings (equal group sizes), this is not possible.

Consider, for instance, the following simple setting:

$A_1$ consists of 99 times the value 1
$A_2$ consists of the value 100
$B_1$ consists of the value -9
$B_2$ consists of the value 99

The average of $A = A_1 \cup A_2$ is about 2 and thus much smaller than the average 45 of $B = B_1 \cup B_2$. On the other hand, the average 1 of $A_1$ is larger than the average -9 of $B_1$. Similarly, the average 100 of $A_2$ is larger than the average 99 of $B_2$.

Hypothesis Testing – How to Compare the Mean of Two Samples with Exponential Distributions

You can test equality of the mean parameters against the alternative that the mean parameters are unequal with a likelihood ratio test (LR test). (However, if the mean parameters do differ and the distribution is exponential, this is a scale shift, not a location shift.)

For a one-tailed test (but only asymptotically in the two tailed case), I believe that the LR test comes out to be equivalent to the following (to show that this is in fact the same as the LR test for the one-tailed case one would need to show the LR statistic was monotonic in $\bar x/\bar y$):

Let's say we parameterize the $i$th observation in the first exponential as having pdf $1/\mu_x \exp(-x_i/\mu_x)$ and the $j$th observation in the second sample as having pdf $1/\mu_y \exp(-y_j/\mu_y)$ (over the obvious domains for the observations and parameters).
(To be clear, we're working in the mean-form not the rate-form here; this won't affect the outcome of the calculations.)

Since the distribution of $X_i$ is a special case of the gamma, $\Gamma(1,\mu_x)$, the distribution of the sum of $X$'s, $S_x=\sum_i X_i$ is distributed $\Gamma(n_x,\mu_x)$; similarly that for the sum of the $Y$s, $S_y$ is $\Gamma(n_y,\mu_y)$.

Because of the relationship between gamma distributions and chi-squared distributions, it turns out that $2/\mu_x S_x$ is distributed $\chi^2_{2n_x}$. The ratio of two chi-squares on their degrees of freedom is F. Hence the ratio, $\frac{\mu_y}{\mu_x}\frac{S_x/n_x}{S_y/n_y} \sim F_{2n_x,2n_y}$.

Under the null hypothesis of equality of means, then, $\bar x/\bar y \sim F_{2n_x,2n_y}$, and under the two sided alternative, the values might tend to be either smaller or larger than a value from the null distribution, so you need a two-tailed test.

Simulation to check that we didn't make some simple mistake in the algebra:

Here I simulated 1000 samples of size 30 for $X$ and 20 for $Y$ from an exponential distribution with the same mean, and computed the above ratio-of-means statistic.

Below is a histogram of the resulting distribution as well as a curve showing the $F$ distribution we computed under the null:

simulated example distribution of ratio statistic under the null

Example, with discussion of computation of two-tailed p-values:

To illustrate the calculation, here's two small samples from exponential distributions. The X-sample has 14 observations from a population with mean 10, the Y-sample has 17 observations from a population with mean 15:

x: 12.173  3.148 33.873  0.160  3.054 11.579 13.491  7.048 48.836 
   16.478  3.323  3.520  7.113  5.358

y:  7.635  1.508 29.987 13.636  8.709 13.132 12.141  5.280 23.447 
   18.687 13.055 47.747  0.334  7.745 26.287 34.390  9.596

The sample means are 12.082 and 16.077 respectively. The ratio of means is 0.7515

The area to the left is straightforward, since it's in the lower tail (calc in R):

 > pf(r,28,34) 
 [1] 0.2210767

We need the probability for the other tail. If the distribution was symmetric in the inverse, it would be straightforward to do this.

A common convention with the ratio of variances F-test (which is similarly two tailed) is simply to double the one-tailed p-value (effectively what is going on as here; that's also what seems to be done in R, for example); in this case it gives a p-value of 0.44.

However, if you do it with a formal rejection rule, by putting an area of $\alpha/2$ in each tail, you'd get critical values as described here. The p-value is then the largest $\alpha$ that would lead to rejection, which is equivalent to adding the one tailed p-value above to the one-tailed p-value in the other tail for the degrees of freedom interchanged. In the above example that gives a p-value of 0.43.

[Neither of these rules are "optimal" in small samples]

Best Answer

Related Solutions

Solved – Comparison of average values of data sets

Hypothesis Testing – How to Compare the Mean of Two Samples with Exponential Distributions

Related Question