Solved – Regression methods for predicting rank

multiple regressionranks

Is there a canonical regression approach for predicting the ranks of a response?

I'd like to fit a regression to a dataset where the response is highly non-normal with very large outliers. There are about 10 predictors. I haven't had much success with transformations (the best has been adding a constant and then logging the response twice, but this isn't very interpretable).

However, I only care about the ranks of the response. The response is really only a score that is used as an instrument for ranking observations. What I really want to know is which predictors explain the most variation in the ranks.

My approach has been the following:

  1. Calculate the ranks of the response. I.e. for each observation $i$, calculate $R(Y_i)$
  2. Suppose $N$ is the number of observations. Then, approximately, $U_i =\frac{R(Y_i)}{N} \sim Unif(0, 1)$
  3. By the Probability Integral Transform, $Z_i = \Phi^{-1}(U_i) \sim N(0,1)$
  4. Use $Z$ as my response in a regression of $Z$ on the predictors

Since these rank and inverse CDF transformation are monotone and thus preserve rank, I reason that this regression approach will help me identify which covariates are most predictive of rank.

Does this approach work? Is there a better or more standard approach to predicting rank with a set of covariates? Googling around, I found this paper but I don't know how accepted or well known the approach is: https://journal.r-project.org/archive/2012-2/RJournal_2012-2_Kloke+McKean.pdf

Thanks!

Best Answer

From what I can tell, the rank-based estimation this paper is referring to is slightly different than what you're interested in. Note that least-squares estimation is based on the idea that $\boldsymbol \beta$ should be chosen to minimize $||\boldsymbol y - \boldsymbol X \boldsymbol \beta||^2$. This isn't suitable in your case because the distribution of $y$ isn't very nice and it's also not really of interest. However, the focus of the paper is still to predict $y$ as a linear function of $X$. The only difference is the way in which it estimates $\boldsymbol \beta$: In their case, they choose $\boldsymbol \beta$ to minimize a rank-based norm which is still applied to $\boldsymbol y - \boldsymbol X \boldsymbol \beta$. Hence, this method is still largely dependent on the distribution of $y$.

You mentioned that you only care about the ranks of the response variable. In other words, you'd be just as well off using $X$ to model $R(Y)$ rather than $Y$ itself. The fact that $R(Y)$ is limited to $[0, 1]$ means that the usual linear regression approach may not work. You could end up with predictions outside the unit interval or you might not even have a linear relationship between $X$ and $R(Y)$. But this really isn't a problem. The usual modeling approach in this situation is to employ a Generalized Linear Model. The only additional step in fitting this model is to choose an appropriate link function.

For example, suppose $X \sim Normal(0, 1)$ and $Y|X \sim Normal(\beta_0 + \beta_1 X, \sigma^2)$. It would then be appropriate to use $X$ to model $R(Y)$ with a GLM and a logit or probit link.

Related Question