[Math] How to weight three different variables to create a ranking

standard deviationstatistics

I have data on the graduation rate, tuition, and average student income (a measure of accessibility) for about $7,000$ colleges and universities in the U.S., and I'm trying to compose an interactive in which people can manipulate the importance of these three variables to create a ranking. Users drag three sliders to emphasize or deemphasize the importance of the variables:

enter image description here

I've standardized the variables through the usual means so that each value is the number of standard deviations from the mean. Each variable also has a weight assigned by the sliders, which add to 1. One user might give $50\%$ weight to graduation rate and $25\%$ to the other two; another might give $33\%$ to all three, and so forth.

My first instinct was to just multiply the weight by the standardized value and add them together for each school's score. But this seems to overly reward outliers in one area, when the goal is to surface schools that perform well over all. Graduation rate is negatively correlated with accessibility and positively correlated to cost, FWIW.

Is there an accepted way to weight multivariate systems like this in a better way?

Best Answer

If $x_1,\ldots,x_n$ are the scores in $n$ categories assigned to some school, pick some $p$ from $(0,\infty)$ and compute the total score as $$ \bar x = \left(\frac{1}{n}\sum_{k=1}^n \textrm{sgn}(x_k)\sqrt[p]{|x_k|} \right)^p \text{ where }\textrm{sgn}(x) = \begin{cases} -1 & \text{if $x < 0$} \\ 0 & \text{if $x=0$} \\ 1 &\text{if $x > 0$} \end{cases} $$ If If $x_1 = \ldots = x_n = x$, you always get $\bar x = (\frac{1}{n}n\,\textrm{sgn}(x)\sqrt[p|]{|x|})^p = sgn(x)\,|x| = x$. The larger $p$ gets, the less do large scores in a single category influence the value. For example, if $x_1=x_2=0$, $x_3=x$, then $\bar x = \frac{x}{3^p}$.

To incorporate weights into this, simply replace the $\frac{1}{n}\sum \ldots$ part, which simply averages the $p$-th roots of the $x_k$, by a weighted average. If your weights are $w_1,\ldots,w_n$, the weighted total score is $$ \bar x = \left(\frac{1}{w}\sum_{k=1}^n w_k\textrm{sgn}(x_k)\sqrt[p]{|x_k|} \right)^p \text{ where $w = w_1 + \ldots + w_k$.} $$

You'll have to play with various values for $p$ to see which work well. Generally, if $p > 1$, values close to zero will influence the result more than values far away from zero. On the other hand, if $p < 1$, then large values will have more influence. In the limit $p \to 0$, $\bar x$ is the value furthest awa from zero, with the other values having no influence (i.e., the opposite of what you seem to want). So you'll want to only look at values $p > 1$.


Note that the business with $\textrm{sgn}$ and the absolute value $|x_k|$ is only necessary if the scores can be negative. It simply works around the issue that $\sqrt[p]{x_k}$ isn't defined for negative $x_k$, so what we do is we take the $p$-th root of the absolute value $|x_k|$, and then add the original sign of $x_k$ back by multiplying with $\textrm{sgn}(x_k)$.

Related Question