Solved – How many data points are in a given quantile in Quantile regression

quantile regressionquantilesregression

I hope somebody can help me with a, probably very fundamental, issue of understanding concerning quantile regression.

My dataset is very skewed, so I've looking at the data with quantile regression because of all the literature claiming that this allows us to make inferrences about the marginal parts of the distribution. However, when we're dealing with a specific quantile, isn't the usual way of frameing this quantile as 'quantile x contains all values below y'.

So, if I perform quantile regression in for all the centiles (0.1,0.2,…,0.9). And I e.g. look at the results for quantile 0.8. Am I then looking at regression coefficients calculated for all members of the dataset below the 0.8 quantile value? Or exactly 'how far back' do the regression include data points?

Attempting to illustrate with graphics…when I look at regression coefficients for the 0.8 quantile, how much of my data is included in the calculation of these results? The entire distribution up until 0.8, or just the members between the current and the previous QR regression point (0.7)

Best Answer

Quantile regression uses the full distribution for every quantile. The best way to understand it is to think of regular linear regression. Regular linear regression is providing an estimate of how much covariate $A$ affects the mean of the outcome $y$. Quantile regression on the Median (50th percentile) provides an estimate of how much covariate $A$ affects the position of Median of the outcome $y$. You need the whole distribution to determine the position of the median (i.e. the point at which 50% of $y$ is above and 50% of $y$ is below).

The same reasoning carries over to the 80th percentile. Quantile regresion tells you how much covariate $A$ affects the position of the 80th percentile, i.e. the point at which 20% of the sample is above, and 80% is below. So it uses the whole distribution to calculate this estimate.

Your example plot is confusing. In future please provide actual code and sample data used to produce such a plot so we can give you more direct insight.

So using a sample data set examining the effects of income on food expenditure in R we have:

library(quantreg)
data(engel) # data in quantreg to be used as a useful example
head(engel)
Linear.Regression <- coef(summary(lm(foodexp ~ income, data=engel)))
Median <- coef(summary(rq(foodexp ~ income, tau=0.5, data=engel)))
Eightieth <- coef(summary(rq(foodexp ~ income, tau= 0.8, data=engel)))

plot(x=engel$income, y=engel$foodexp, cex=0.5, 
     xlab="Income", ylab="Food Expenditure")
abline(a=Linear.Regression[1,1], b=Linear.Regression[2,1],
       col="red")
abline(a=Median[1,1], b=Median[2,1], col="blue")
abline(a=Eightieth[1,1], b=Eightieth[2,1], col="green")

You get a plot that looks like this:

The red line is the linear estimate, the blue line is the median estimate (50th percentile), and the green line is the 80th percentile estimate.

Note: this example is the same one provided by Roger Koenker 2005 "Quantile regression"

Related Solutions

Solved – Quantile Regression vs OLS for homoscedasticity

Will the estimated slope coefficient $\beta_1$ always be the same for OLS and for QR for different quantiles?

No, of course not, because the empirical loss function being minimized differs in these different cases (OLS vs. QR for different quantiles).

I am well aware that in the presence of homoscedasticity all the slope parameters for different quantile regressions will be the same and that the QR models will differ only in the intercept.

No, not in finite samples. Here is an example taken from the help files of the quantreg package in R:

    library(quantreg)
    data(stackloss)
    rq(stack.loss ~ stack.x,tau=0.50) #median (l1) regression fit 
                                      # for the stackloss data.
    rq(stack.loss ~ stack.x,tau=0.25) #the 1st quartile

However, asymptotically they will all converge to the same true value.

But in the case of homoscedasticity, shouldn't outliers cancel each other out because positive errors are as likely as negative ones, rendering OLS and median QR slope coefficient equivalent?

No. First, perfect symmetry of errors is not guaranteed in any finite sample. Second, minimizing the sum of squares vs. absolute values will in general lead to different values even for symmetric errors.

Solved – Stepwise quantile regression: What’s the reason behind these strange results

By setting $\tau=0.99$ you are essentially fitting to the noise at the extreme high-end of your sampling distribution. If you imagine the only-slightly-more-extreme case with $\tau=1.0$, you would be fitting exclusively to the points comprising the upper-envelope of your data, in other words sparsely-distributed points which are almost guaranteed to be unrepresentative outliers. See the plot below for reference.

What is it about your specific use-case that makes you want to perform your fit at such a high percentile?

Best Answer

Related Solutions

Solved – Quantile Regression vs OLS for homoscedasticity

Solved – Stepwise quantile regression: What’s the reason behind these strange results

Related Question