Statistical Significance – How to Compare Equality of Distribution at Different Quantiles

quantilesstatastatistical significance

I have a dataset from which I am taking a set of descriptive statistics as follows:

The value measured is productivity of a firm for each of the group at different quantile (I use Stata command: table group_var, c(p10 p25 p50… ).

The below is repeated for two subsets of the main data (I have two sets of descriptive stats as below).

              p10   p25   p50   p75   p90
group1        50    45    43    34    10
group2         ......

What I want to do is to compare the two descriptive stats for statistical significance:

so there will be:

       p10_1   p10_2
group1  50       52
group2  52      ....
....

I want to determine whether these are different, for p50 I am using a ranksum (median) test in stata. for mean, running a t-test but struggle to find a method for values at different quantiles. Can someone suggest the right approach?

Thanks,

Paul

Best Answer

You can do this using quantile regression.

The code below does the single quantile case. It

estimates the q90 price for foreign and domestic cars with various repair records. Here origin is like your two cities and repair record is like your groups.
calculates the statistics within each origin $\times$ repair cell from the model, which should match the output of the table command.
tests the hypothesis that the q90 prices for each group are the same regardless of manufacturing origin.

Here's the output:

. sysuse auto, clear
(1978 automobile data)

. table rep78 foreign, statistic(p90 price) nototals

----------------------------------------
                   |      Car origin    
                   |  Domestic   Foreign
-------------------+--------------------
Repair record 1978 |                    
  1                |      4934          
  2                |     14500          
  3                |     13466      6295
  4                |      8814      9735
  5                |      4425     11995
----------------------------------------

. keep if rep78>2 & !missing(rep78)
(15 observations deleted)

. qreg price i.rep78##i.foreign, quantile(0.9) nolog

.9 Quantile regression                              Number of obs =         59
  Raw sum of deviations  39983.8 (about 11385)
  Min sum of deviations  32468.6                    Pseudo R2     =     0.1880

-------------------------------------------------------------------------------
        price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
        rep78 |
           4  |      -4652   2130.507    -2.18   0.033    -8925.256    -378.744
           5  |      -9041   4056.365    -2.23   0.030    -17177.04   -904.9629
              |
      foreign |
     Foreign  |      -7171   3368.627    -2.13   0.038    -13927.61   -414.3889
              |
rep78#foreign |
   4#Foreign  |       8092   4261.014     1.90   0.063    -454.5121    16638.51
   5#Foreign  |      14741   5483.728     2.69   0.010     3742.034    25739.97
              |
        _cons |      13466   1065.254    12.64   0.000     11329.37    15602.63
-------------------------------------------------------------------------------

. margins foreign#rep78, post // coeflegend
warning: cannot perform check for estimable functions.

Adjusted predictions                                        Number of obs = 59
Model VCE: IID

Expression: Linear prediction, predict()

-------------------------------------------------------------------------------
              |            Delta-method
              |     Margin   std. err.      z    P>|z|     [95% conf. interval]
--------------+----------------------------------------------------------------
foreign#rep78 |
  Domestic#3  |      13466   1065.254    12.64   0.000     11378.14    15553.86
  Domestic#4  |       8814   1845.073     4.78   0.000     5197.723    12430.28
  Domestic#5  |       4425   3913.991     1.13   0.258    -3246.282    12096.28
   Foreign#3  |       6295   3195.761     1.97   0.049     31.42429    12558.58
   Foreign#4  |       9735   1845.073     5.28   0.000     6118.723    13351.28
   Foreign#5  |      11995   1845.073     6.50   0.000     8378.723    15611.28
-------------------------------------------------------------------------------

. test ///
> (_b[0.foreign#3.rep78] = _b[1.foreign#3.rep78]) ///
> (_b[0.foreign#4.rep78] = _b[1.foreign#4.rep78]) ///
> (_b[0.foreign#5.rep78] = _b[1.foreign#5.rep78])

 ( 1)  0bn.foreign#3bn.rep78 - 1.foreign#3bn.rep78 = 0
 ( 2)  0bn.foreign#4.rep78 - 1.foreign#4.rep78 = 0
 ( 3)  0bn.foreign#5.rep78 - 1.foreign#5.rep78 = 0

           chi2(  3) =    7.72
         Prob > chi2 =    0.0522

The p-value on the two-sided null that the q90 foreign and domestic prices are the same for repair record 3, the same for 4, and the same for 5 is .0522. This means that it is fairly unlikely that we would observe differences like this (or larger) if they were the same for each repair record group.

But you want to test more than one quantile at the same time, so you need to use sqreg for simultaneous-quantile regression. It produces the same coefficients as qreg for each quantile. Reported standard errors will be similar, but sqreg obtains an estimate of the VCE via bootstrapping, and the VCE includes between-quantile blocks. This lets you do tests comparing predictions at different quantiles:

. sysuse auto, clear
(1978 automobile data)

. table rep78 foreign, stat(p50 price) statistic(p90 price)  nototals

-----------------------------------------
                    |      Car origin    
                    |  Domestic   Foreign
--------------------+--------------------
Repair record 1978  |                    
  1                 |                    
    50th percentile |    4564.5          
    90th percentile |      4934          
  2                 |                    
    50th percentile |      4638          
    90th percentile |     14500          
  3                 |                    
    50th percentile |      4749      4296
    90th percentile |     13466      6295
  4                 |                    
    50th percentile |      5705      6229
    90th percentile |      8814      9735
  5                 |                    
    50th percentile |    4204.5      5719
    90th percentile |      4425     11995
-----------------------------------------

. keep if rep78>2 & !missing(rep78)
(15 observations deleted)

. sqreg price i.rep78##i.foreign, quantile(0.5 0.9) nolog

Simultaneous quantile regression                    Number of obs =         59
  bootstrap(20) SEs                                 .50 Pseudo R2 =     0.0574
                                                    .90 Pseudo R2 =     0.1880

-------------------------------------------------------------------------------
              |              Bootstrap
        price | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
q50           |
        rep78 |
           4  |        956   410.4789     2.33   0.024     132.6835    1779.316
           5  |       -324   358.3828    -0.90   0.370    -1042.825    394.8248
              |
      foreign |
     Foreign  |       -453   1002.362    -0.45   0.653    -2463.483    1557.483
              |
rep78#foreign |
   4#Foreign  |        977   1286.285     0.76   0.451    -1602.962    3556.962
   5#Foreign  |       1747   1127.146     1.55   0.127    -513.7688    4007.769
              |
        _cons |       4749   326.4424    14.55   0.000      4094.24     5403.76
--------------+----------------------------------------------------------------
q90           |
        rep78 |
           4  |      -4652   1808.008    -2.57   0.013    -8278.405   -1025.595
           5  |      -9041   1618.042    -5.59   0.000    -12286.38   -5795.619
              |
      foreign |
     Foreign  |      -7171   2282.641    -3.14   0.003     -11749.4   -2592.601
              |
rep78#foreign |
   4#Foreign  |       8092    2949.62     2.74   0.008     2175.812    14008.19
   5#Foreign  |      14741     2256.1     6.53   0.000     10215.84    19266.16
              |
        _cons |      13466   1601.298     8.41   0.000      10254.2     16677.8
-------------------------------------------------------------------------------

. margins foreign#rep78, predict(equation(q50)) predict(equation(q90)) post // coeflegend

Adjusted predictions                                        Number of obs = 59
Model VCE: Bootstrap

1._predict: Linear prediction, predict(equation(q50))
2._predict: Linear prediction, predict(equation(q90))

----------------------------------------------------------------------------------------
                       |            Delta-method
                       |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-----------------------+----------------------------------------------------------------
_predict#foreign#rep78 |
         1#Domestic#3  |       4749   326.4424    14.55   0.000     4109.185    5388.815
         1#Domestic#4  |       5705   330.9142    17.24   0.000      5056.42     6353.58
         1#Domestic#5  |       4425   221.6575    19.96   0.000     3990.559    4859.441
          1#Foreign#3  |       4296   975.0888     4.41   0.000     2384.861    6207.139
          1#Foreign#4  |       6229   860.8888     7.24   0.000     4541.689    7916.311
          1#Foreign#5  |       5719   990.1683     5.78   0.000     3778.306    7659.694
         2#Domestic#3  |      13466   1601.298     8.41   0.000     10327.51    16604.49
         2#Domestic#4  |       8814   1048.677     8.40   0.000     6758.631    10869.37
         2#Domestic#5  |       4425   221.6575    19.96   0.000     3990.559    4859.441
          2#Foreign#3  |       6295   1123.791     5.60   0.000     4092.411    8497.589
          2#Foreign#4  |       9735   1285.327     7.57   0.000     7215.806    12254.19
          2#Foreign#5  |      11995   1902.861     6.30   0.000     8265.462    15724.54
----------------------------------------------------------------------------------------

. test ///
> (_b[1._predict#0.foreign#3.rep78] = _b[1._predict#1.foreign#3.rep78]) ///
> (_b[1._predict#0.foreign#4.rep78] = _b[1._predict#1.foreign#4.rep78]) ///
> (_b[1._predict#0.foreign#5.rep78] = _b[1._predict#1.foreign#5.rep78]) ///
> (_b[2._predict#0.foreign#3.rep78] = _b[2._predict#1.foreign#3.rep78]) ///
> (_b[2._predict#0.foreign#4.rep78] = _b[2._predict#1.foreign#4.rep78]) ///
> (_b[2._predict#0.foreign#5.rep78] = _b[2._predict#1.foreign#5.rep78]) 

 ( 1)  1bn._predict#0bn.foreign#3bn.rep78 - 1bn._predict#1.foreign#3bn.rep78 = 0
 ( 2)  1bn._predict#0bn.foreign#4.rep78 - 1bn._predict#1.foreign#4.rep78 = 0
 ( 3)  1bn._predict#0bn.foreign#5.rep78 - 1bn._predict#1.foreign#5.rep78 = 0
 ( 4)  2._predict#0bn.foreign#3bn.rep78 - 2._predict#1.foreign#3bn.rep78 = 0
 ( 5)  2._predict#0bn.foreign#4.rep78 - 2._predict#1.foreign#4.rep78 = 0
 ( 6)  2._predict#0bn.foreign#5.rep78 - 2._predict#1.foreign#5.rep78 = 0

           chi2(  6) =   50.71
         Prob > chi2 =    0.0000

The factor variable notation above is tricky, but it is just quantile $\times$ origin $\times$ repair record level. The coeflegend can be useful here for decoding, but I left it commented out.

Here we reject the two-sided null that the q50 and q90 foreign and domestic prices are the same for repair record 3, the same for 4, and the same for 5: the p-value is effectively zero.

Related Solutions

Solved – Regression of Y on different quantiles of X in Stata

This is a highly problematic statistical practice. Besides greatly reducing power, it actually assumes a very strange relationship between X and Y. Suppose that X was linearly related to Y and that X did not have a uniform distribution. The the proposed approach would come up with a nonlinear relationship. It does not make sense for X to relate to Y by the shape of the marginal distribution of X. A better approach would be to fit a regression spline in the original X. You are also assuming that slope changes occur at certain percentiles of X which is hard to believe.

Solved – How to determine overlap of two empirical distribution based on quantiles

Because you will be doing this for $\binom{10}{2}=45$ pairs of distributions, you will want a reasonably efficient method.

The question asks to solve (at least approximately) an equation of the form $G_0(\alpha)-G_1(1-\alpha)=0$ where the $G_i$ are the inverse empirical CDFs. Equivalently, you could solve $F_0(z)+F_1(z)-1=0$ where the $F_i$ are the empirical CDFs. That is best done with a root-finding method which does not assume the function is differentiable (or even continuous) because these functions are discontinuous: they jump at the data values.

In R, uniroot will do the job. Although it assumes the functions are continuous (it uses Brent's Method, I believe), R's implementation of the empirical CDFs makes them look sufficiently continuous. To make this method work you need to bracket the root between known bounds, but this is easy: it must lie within the range of the union of both datasets.

The code is remarkably simple: given two data arrays x and y, create their empirical CDF functions F.x and F.y, then invoke uniroot. That's all you need.

overlap <- function(x, y) {
  F.x <- ecdf(x); F.y <- ecdf(y)
  z <- uniroot(function(z) F.x(z) + F.y(z) - 1, interval<-c(min(c(x,y)), max(c(x,y))))
  return(list(Root=z, F.x=F.x, F.y=F.y))
}

It is reasonably fast: applied to all $45$ pairs of ten datasets ranging in size from $1000$ to $8000$, it found the answers in a total of $0.12$ seconds.

Alternatively, notice that the desired point is the median of an equal mixture of the two distributions. When the two datasets are the same size, just obtain the median of the union of all the data! You can generalize this to datasets of different sizes by computing weighted medians. This capability is available via quantile regression (in the quantreg package), which accommodates weights: regress the data against a constant and weight them in inverse proportion to the sizes of the datasets.

overlap.rq <- function(x, y) {
  library(quantreg)
  fit <- rq(c(x,y) ~ 1, data=d, 
            weights=c(rep(1/length(x), length(x)), rep(1/length(y), length(y))))
  return(coef(fit))
}

Timing tests show this is at least three times slower than the root-finding method and it does not scale as well for larger datasets: on the preceding test with $45$ pairs of datasets it took $1.67$ seconds, more than ten times slower. The chief advantage is that this particular implementation of weighted medians will issue warnings when the answer does not appear unique, whereas Brent's method tends to find unique answers right in the middle of an interval of possible answers.

As a demonstration, here is a plot of two empirical CDFs along with vertical lines showing the two solutions (and horizontal lines marking the levels of $\alpha$ and $1-\alpha$). In this particular case, the two methods produce the same answer so only one vertical line appears.

#
# Generate some data.
#
set.seed(17)
x <- rnorm(32, 5, 2)
y <- rgamma(10, 2)
#
# Compute the solution two ways.
#
solution <- overlap(x, y)
solution.rq <- overlap.rq(x, y)
F.x <- solution$F.x; F.y <- solution$F.y; z <- solution$Root
alpha <- c(F.x(z$root), F.y(z$root))
#
# Plot the ECDFs and the results.
#
plot(interval, 0:1, type="n", xlab="z", ylab="Probability", main="CDFs")
curve(F.x(x), add=TRUE, lwd=2, col="Red")
curve(F.y(x), add=TRUE, lwd=2, col="Blue")
abline(v=z$root, lty=2)
abline(v=solution.rq, lty=2, col="Green")
abline(h=alpha, lty=3, col="Gray")

Best Answer

Related Solutions

Solved – Regression of Y on different quantiles of X in Stata

Solved – How to determine overlap of two empirical distribution based on quantiles

Related Question