Ranking – Creating an Index of Quality from Multiple Variables

rankingvaluation

I have four numeric variables. All of them are measures of soil quality. Higher the variable, higher the quality. The range for all of them is different:

Var1 from 1 to 10

Var2 from 1000 to 2000

Var3 from 150 to 300

Var4 from 0 to 5

I need to combine four variables into single soil quality score which will successfully rank order.

My idea is very simple. Standardize all four variables, sum them up and whatever you get is the score which should rank-order. Do you see any problem with applying this approach. Is there any other (better) approach that you would recommend?

Thanks

Edit:

Thanks guys. A lot of discussion went into "domain expertise"… Agriculture stuff… Whereas I expected more stats-talk. In terms of technique that I will be using… It will probably be simple z-score summation + logistic regression as an experiment. Because vast majority of samples has poor quality 90% I'm going to combine 3 quality categories into one and basically have binary problem (somequality vs no-quality). I kill two birds with one stone. I increase my sample in terms of event rate and I make a use of experts by getting them to clasify my samples. Expert classified samples will then be used to fit log-reg model to maximize level of concordance / discordance with the experts…. How does that sound to you?

Best Answer

The proposed approach may give a reasonable result, but only by accident. At this distance--that is, taking the question at face value, with the meanings of the variables disguised--some problems are apparent:

  1. It is not even evident that each variable is positively related to "quality." For example, what if a 10 for 'Var1' means the "quality" is worse than the quality when Var1 is 1? Then adding it to the sum is about as wrong a thing as one can do; it needs to be subtracted.

  2. Standardization implies that "quality" depends on the data set itself. Thus the definition will change with different data sets or with additions and deletions to these data. This can make the "quality" into an arbitrary, transient, non-objective construct and preclude comparisons between datasets.

  3. There is no definition of "quality". What is it supposed to mean? Ability to block migration of contaminated water? Ability to support organic processes? Ability to promote certain chemical reactions? Soils good for one of these purposes may be especially poor for others.

  4. The problem as stated has no purpose: why does "quality" need to be ranked? What will the ranking be used for--input to more analysis, selecting the "best" soil, deciding a scientific hypothesis, developing a theory, promoting a product?

  5. The consequences of the ranking are not apparent. If the ranking is incorrect or inferior, what will happen? Will the world be hungrier, the environment more contaminated, scientists more misled, gardeners more disappointed?

  6. Why should a linear combination of variables be appropriate? Why shouldn't they be multiplied or exponentiated or combined as a posynomial or something even more esoteric?

  7. Raw soil quality measures are commonly re-expressed. For example, log permeability is usually more useful than the permeability itself and log hydrogen ion activity (pH) is much more useful than the activity. What are the appropriate re-expressions of the variables for determining "quality"?

One would hope that soils science would answer most of these questions and indicate what the appropriate combination of the variables might be for any objective sense of "quality." If not, then you face a multi-attribute valuation problem. The Wikipedia article lists dozens of methods for addressing this. IMHO, most of them are inappropriate for addressing a scientific question. One of the few with a solid theory and potential applicability to empirical matters is Keeney & Raiffa's multiple attribute valuation theory (MAVT). It requires you to be able to determine, for any two specific combinations of the variables, which of the two should rank higher. A structured sequence of such comparisons reveals (a) appropriate ways to re-express the values; (b) whether or not a linear combination of the re-expressed values will produce the correct ranking; and (c) if a linear combination is possible, it will let you compute the coefficients. In short, MAVT provides algorithms for solving your problem provided you already know how to compare specific cases.

Related Question