Weight of Evidence (WoE) in Logistic Regression – A Guide

categorical datalogisticmodelingregression

This is a question regarding a practice or method followed by some of my colleagues. While making a logistic regression model, I have seen people replace categorical variables (or continuous variables which are binned) with their respective Weight of Evidence (WoE). This is supposedly done to establish a monotonic relation between the regressor and dependent variable. Now as far as I understand, once the model is made, the variables in the equation are NOT the variables in the dataset. Rather, the variables in the equation are now kind of the importance or weight of the variables in segregating the dependent variable!

My question is : how do we now interpret the model or the model coefficients? For example for the following equation :
$$
\log\bigg(\frac{p}{1-p}\bigg) = \beta_0 + \beta_1x_1
$$

we can say that $\exp(\beta_1)$ is the relative increase in odd's ratio for 1 unit increase in the variable $x_1$.

But if the variable is replaced by its WoE, then the interpretation will be changed to : relative increase in odd's ratio for 1 unit increase in the IMPORTANCE / WEIGHT of the variable

I have seen this practice in internet, but nowhere I found answer of this question. This link from this community itself is related to somewhat similar query where someone wrote:

WoE displays a linear relationship with the natural logarithm of the
odds ratio which is the dependent variable in logistic regression.
Therefore, the question of model misspecification does not arise in
logistic regression when we use WoE instead of the actual values of
the variable.

But I still don't get the explanation. Please help me understand what I am missing.

Best Answer

The WoE method consists of two steps:

  1. to split (a continuous) variable into few categories or to group (a discrete) variable into few categories (and in both cases you assume that all observations in one category have "same" effect on dependent variable)
  2. to calculate WoE value for each category (then the original x values are replaced by the WoE values)

The WoE transformation has (at least) three positive effects:

  1. It can transform an independent variable so that it establishes monotonic relationship to the dependent variable. Actually it does more than this - to secure monotonic relationship it would be enough to "recode" it to any ordered measure (for example 1,2,3,4...) but the WoE transformation actually orders the categories on a "logistic" scale which is natural for logistic regression

  2. For variables with too many (sparsely populated) discrete values, these can be grouped into categories (densely populated) and the WoE can be used to express information for the whole category

  3. The (univariate) effect of each category on dependent variable can be simply compared across categories and across variables because WoE is standardized value (for example you can compare WoE of married people to WoE of manual workers)

It also has (at least) three drawbacks:

  1. Loss of information (variation) due to binning to few categories

  2. It is a "univariate" measure so it does not take into account correlation between independent variables

  3. It is easy to manipulate (overfit) the effect of variables according to how categories are created

Conventionally, the betas of the regression (where the x has been replaced by WoE) are not interpreted per se but they are multiplied with WoE to obtain a "score" (for example beta for variable "marital status" can be multiplied with WoE of "married people" group to see the score of married people; beta for variable "occupation" can be multiplied by WoE of "manual workers" to see the score of manual workers. then if you are interested in the score of married manual workers, you sum up these two score and see how much is the effect on outcome). The higher the score is, the greater is probability of an outcome equal to 1.

Related Question