Solved – Interpretation of logistic regression with normalized features

cross correlationinterpretationlogistictrain

With logistic regression, a one unit change in $X_1$ is associated with a $\beta_1$ change in the log odds of 'success' (alternatively, an $\exp(\beta_1)$-fold change in the odds), all else being equal. But if one applies an initial normalization to cross-correlated features (e.g. subtract by mean and divide by standard deviation), is it valid to simply apply that inverse transformation to $\beta_1$ to interpret a one unit change in the un-normalized value when considering the raw data? The normalization explained above has no effect on the cross-correlation of the features themselves, but I am curious if it would affect the outputs (and signs specifically) of the $\beta_i$ that are being trained.

Best Answer

The interpretation of logistic regression coefficients is similar in the case where you've standardized the data (subtract mean, divide by standard deviation of each feature). By standardizing, you effectively change the units to standard deviations above/below the mean. So, a one standard deviation increase in $X_1$ corresponds to a $\beta_1$ increase in the log odds. If you fit to standardized data, you can transform the coefficients back to the original units (or vice versa).

If you fit a vanilla logistic regression model to standardized vs. non-standardized data, the coefficients will take different values in each case, but both models will fit equally well (or poorly). But, this is not necessarily true if you're fitting a regularized model (e.g. $\ell_1$ or $\ell_2$ penalties on the coefficients). In this case, it's common practice to standardize first, so that all features are penalized equally.

Related Solutions

Machine Learning – Why AUC Score is Less Than 0.5 for Logistic Regression?

UPDATE: Sycorax posted the following link in the comments: Can a random forest be used for feature selection in multiple linear regression? deals with this problem and describes why this might not work too well.

Similar explanation: your data/model might suffer from the Curse of dimensionality, as logistic regression is prone to fall to this curse.

Several points: (might be comments with enough reputation)

pipe.fit(X_train, y_train)

Where did you define the training data?

Have you tried class_weight="balanced" for logistic regression? This might produce a different rate of misclassification.

What were the results without the RFE step?

Solved – logistic regression interpretation with multiple categorical variables

The coefficients of a logistic regression cannot be directly interpreted as odds-ratio. One possible way to interpret them is to get back to the definition of a logistic. If the estimated coefficients are $\beta$, the predicted probability for a user with characteristics $X_i$ is $$ \hat p(X_i) = F(X_i) = \frac{1}{1+e^{-X_i \hat \beta}} $$ Now, to get to your questions.

$\beta_3$ is the coefficient reflecting the marginal effect of being in Paris rather than in Rome (as you guessed, the interpretation of $\beta_3$ should be done with respect to the reference category: Rome). From there, you can either compute the marginal effect of being in Paris (rather than Rome), say from someone in the normal age group: $$ \hat p(normal, Paris) -\hat p(normal, Rome) = \frac{1}{1+e^{-\beta_0 - \beta_3}} - \frac{1}{1+e^{-\beta_0}} $$ or you can compute the odds-ratio (see the definition of the odds): $e^{\beta_3}$
Both Paris and London have positive coefficients. Considering only these point estimates, we can say that users in Rome tend to convert less than in Paris and London. With the current version of the model, this is true for users from all age groups. If you want to say something about how the conversion of young users differ by city compared to other old users (say), you would need to introduce interactions ($Rome*young$, $Rome*normal$, $Rome*old$, $Paris*young$, etc) and estimate the coefficients relative to all interactions.

Best Answer

Related Solutions

Machine Learning – Why AUC Score is Less Than 0.5 for Logistic Regression?

Solved – logistic regression interpretation with multiple categorical variables

Related Question