Regression – Is the Intuition Behind Weight Matrix Correct for Quantile Regression?

quantile regressionquantilesregressionresiduals

Motivating Question

I was asked by a colleague today why one would run a quantile regression on quantiles that are "extreme" (such as $.10$, $.90$, etc.), if there are too few observations in that quantile (say a total $n=200$ and the quantile contains some division of this number). This perplexed me, as this is the second time I've been asked this question but my way of explaining it may be limited. As it is my understanding, quantile regressions estimate conditional quantiles based on a weight matrix that observes all of the values in a distribution, and consequently the location of the quantile shifts the "fulcrum" of the weighting rather than only estimating that specific quantile's data points.

Indeed, when we run an OLS regression, we expect a naive estimate to be the mean of $y$, noted $\bar{y}$, and the conditional mean of $y$ to be the expected value of the conditional mean of $y$ given $x$, or $E(y|x)$. This is not estimated with just the values around the mean, but the entire distribution of values. Similarly for the quantile regression framework, we instead condition the expectation to be $Q(y|x)$ instead, where $Q$ is the conditional quantile of $y$. Because of the estimation method using absolute residuals rather than sums of squared residuals, my best guess of how to visualize this is to plot lines of residuals based on their weighting for a given distribution.

Solution

Here I have run some R code for a quantile regression which estimates $\tau = .25$, or the conditional 25th quantile. I have changed the size of the residual lines under the fitted line to resemble that the residuals here are given proportionate weights based on the fitting.

#### Load Libraries ####
library(quantreg)
library(tidyverse)

### Sim Data ####
set.seed(123)
x <- rnorm(200)
y <- (.40 * x) + rnorm(200)
plot(x,y)

#### Fit Q25 Regression ####
qu <- .25
fit <- rq(
  y ~ x,
  tau = qu
)
summary(fit)

#### Plot ####
broom::augment(fit) %>% 
  mutate(weight = ifelse(.resid > 0, "Higher","Lower")) %>% 
  ggplot(aes(x=x,y=y))+
  geom_point(
    size = 3,
    color = "steelblue"
  )+
  stat_quantile(
    quantiles = .25
  )+
  geom_segment(
    aes(xend=x,
        yend=.fitted,
        linewidth = ifelse(weight == "Higher",
                           .25, .75),
        alpha = .3)
  )+
  theme_classic()+
  theme(legend.position = "none")+
  labs(x="Simulated X",
       y="Simulated Y",
       title="Weighted Observations for Q25")

Shown below:

Is this a correct way to visualize this? Is my understanding incorrect?

Edit

Whuber's advice provides a useful test case for my assumptions. Running the same regression with an extreme quantile of $\tau = .001$ gives us this:

#### Load Libraries ####
library(quantreg)
library(tidyverse)

### Sim Data ####
set.seed(123)
x <- rnorm(200)
y <- (.40 * x) + rnorm(200)

#### Fit Q001 Regression ####
qu <- .001
fit <- rq(
  y ~ x,
  tau = qu
)
summary(fit)

#### Plot ####
broom::augment(fit) %>% 
  mutate(weight = ifelse(.resid > 0, "Higher","Lower")) %>% 
  ggplot(aes(x=x,y=y))+
  geom_point(
    size = 3,
    color = "steelblue"
  )+
  stat_quantile(
    quantiles = .001
  )+
  geom_segment(
    aes(xend=x,
        yend=.fitted,
        linewidth = ifelse(weight == "Higher",
                           .001, .999),
        alpha = .3)
  )+
  theme_classic()+
  theme(legend.position = "none")+
  labs(x="Simulated X",
       y="Simulated Y",
       title="Weighted Observations for Q001")

Which gives us this plot:

But this seems wrong on two fronts: 1) The values here are so extreme that there are literally no points where you can find residuals, which in the case I am considering this isn't reality (fitting to $\tau = .01$ gives similar results 2) the weighting of the lines now looks bad (perhaps based on poor coding) and so this doesn't instruct me further on what is wrong/right here.

Best Answer

The OP wrote

As it is my understanding, quantile regressions estimate conditional quantiles based on a weight matrix that observes all of the values in a distribution,(...)

Two pieces of information on the matter

In the paper that introduced Quantile Regression, Koenker, R., & Bassett Jr, G. (1978). Regression quantiles. Econometrica, Theorem 3.1 tells us that the coefficient vector related to a quantile, is equal to a linear function of a subset of the sample. Let $\{y,X\}$ denote the sample (dependent variable, regressor matrix), and let $h$ be a set containing some observation indices, with cardinality equal to the number of regressors (columns of $X$). So $\{y(h), X(h)\}$ represents some subset of the sample and $X(h)$ is a square matrix. Then Theorem 3.1 states, that for quantile probability $\tau$, the solution/optimizer vector is $$\beta(\tau)^* = X(h)^{-1}y(h).$$ This, for example, means that if your regressor matrix has two columns (say, one constant and one variable), then the coefficient vector (for each $\tau$) will be determined by the above function of just two observations.
In Koenker's book "Quantile Regression" (2005), the author writes (p. 11)

One occasionally encounters the view that quantile regression estimators must "ignore sample information" since they are inherently determined by a small subset of the observations. This view neglects the obvious fact that all the observations participate in which "basic" observations are selected as basic.

He refers to how the members of $h$ are chosen.

These properties of quantile regression come from the fact that it can be formulated as a linear programming problem.

Related Solutions

Solved – How many data points are in a given quantile in Quantile regression

Quantile regression uses the full distribution for every quantile. The best way to understand it is to think of regular linear regression. Regular linear regression is providing an estimate of how much covariate $A$ affects the mean of the outcome $y$. Quantile regression on the Median (50th percentile) provides an estimate of how much covariate $A$ affects the position of Median of the outcome $y$. You need the whole distribution to determine the position of the median (i.e. the point at which 50% of $y$ is above and 50% of $y$ is below).

The same reasoning carries over to the 80th percentile. Quantile regresion tells you how much covariate $A$ affects the position of the 80th percentile, i.e. the point at which 20% of the sample is above, and 80% is below. So it uses the whole distribution to calculate this estimate.

Your example plot is confusing. In future please provide actual code and sample data used to produce such a plot so we can give you more direct insight.

So using a sample data set examining the effects of income on food expenditure in R we have:

library(quantreg)
data(engel) # data in quantreg to be used as a useful example
head(engel)
Linear.Regression <- coef(summary(lm(foodexp ~ income, data=engel)))
Median <- coef(summary(rq(foodexp ~ income, tau=0.5, data=engel)))
Eightieth <- coef(summary(rq(foodexp ~ income, tau= 0.8, data=engel)))

plot(x=engel$income, y=engel$foodexp, cex=0.5, 
     xlab="Income", ylab="Food Expenditure")
abline(a=Linear.Regression[1,1], b=Linear.Regression[2,1],
       col="red")
abline(a=Median[1,1], b=Median[2,1], col="blue")
abline(a=Eightieth[1,1], b=Eightieth[2,1], col="green")

You get a plot that looks like this:

The red line is the linear estimate, the blue line is the median estimate (50th percentile), and the green line is the 80th percentile estimate.

Note: this example is the same one provided by Roger Koenker 2005 "Quantile regression"

Solved – How to calculate MSE in a quantile regression simulation study

yes, it is correct way to calculate the mean square error of predicted conditional quantile given x. However, generally, you would need to calculate the mean square error of the estimated parameters of the quantile regression model.