This is a question regarding a practice or method followed by some of my colleagues. While making a logistic regression model, I have seen people replace categorical variables (or continuous variables which are binned) with their respective Weight of Evidence (WoE). This is supposedly done to establish a monotonic relation between the regressor and dependent variable. Now as far as I understand, once the model is made, the variables in the equation are NOT the variables in the dataset. Rather, the variables in the equation are now kind of the importance or weight of the variables in segregating the dependent variable!
My question is : how do we now interpret the model or the model coefficients? For example for the following equation :
$$
\log\bigg(\frac{p}{1-p}\bigg) = \beta_0 + \beta_1x_1
$$
we can say that $\exp(\beta_1)$ is the relative increase in odd's ratio for 1 unit increase in the variable $x_1$.
But if the variable is replaced by its WoE, then the interpretation will be changed to : relative increase in odd's ratio for 1 unit increase in the IMPORTANCE / WEIGHT of the variable
I have seen this practice in internet, but nowhere I found answer of this question. This link from this community itself is related to somewhat similar query where someone wrote:
WoE displays a linear relationship with the natural logarithm of the
odds ratio which is the dependent variable in logistic regression.
Therefore, the question of model misspecification does not arise in
logistic regression when we use WoE instead of the actual values of
the variable.
But I still don't get the explanation. Please help me understand what I am missing.
Best Answer
The WoE method consists of two steps:
The WoE transformation has (at least) three positive effects:
It can transform an independent variable so that it establishes monotonic relationship to the dependent variable. Actually it does more than this - to secure monotonic relationship it would be enough to "recode" it to any ordered measure (for example 1,2,3,4...) but the WoE transformation actually orders the categories on a "logistic" scale which is natural for logistic regression
For variables with too many (sparsely populated) discrete values, these can be grouped into categories (densely populated) and the WoE can be used to express information for the whole category
The (univariate) effect of each category on dependent variable can be simply compared across categories and across variables because WoE is standardized value (for example you can compare WoE of married people to WoE of manual workers)
It also has (at least) three drawbacks:
Loss of information (variation) due to binning to few categories
It is a "univariate" measure so it does not take into account correlation between independent variables
It is easy to manipulate (overfit) the effect of variables according to how categories are created
Conventionally, the betas of the regression (where the x has been replaced by WoE) are not interpreted per se but they are multiplied with WoE to obtain a "score" (for example beta for variable "marital status" can be multiplied with WoE of "married people" group to see the score of married people; beta for variable "occupation" can be multiplied by WoE of "manual workers" to see the score of manual workers. then if you are interested in the score of married manual workers, you sum up these two score and see how much is the effect on outcome). The higher the score is, the greater is probability of an outcome equal to 1.