Solved – Partial Dependence plot interpretation for Categorical variables

data visualizationnon-independentpartial-effect

I am using partial dependence plot from random forest. The partial plot doesn't make sense to me. 10th completed people have only 62 out of 933 people as 1. But the partial plot shows positive bar, while doctorate have 3/4 of the population under 1 and partial plot shows negative bar.

Data: Count falling under each category of education

Best Answer

Partial plots don't have to indicate in the same direction of the data univariately, in fact, this is what makes them useful.

Partial plots are showing you the marginal effect of just this variable. It is likely that there are predictors in your dataset heavily correlated with Education=10th and Education=Doctorate that already account for the univariate effect. Once that effect is controlled for, Education=Doctorate really does reduce your propensity to be whatever your IV is.

Here's a contrived example. Imagine we're trying to predict drinks_coffee, and have data like this:

education  likes_coffee  drinks_coffee
     10th             1              1
     10th             0              0
     10th             0              0
     10th             0              0
Doctorate             1              1
Doctorate             1              1
Doctorate             1              0 *
Doctorate             0              0

Univariately, education=Doctorate seems to imply greater propensity to drink coffee. However, if we include likes_coffee in a model, the effect of having education=Doctorate actually decreases your propensity to drink coffee. likes_coffee soaks up the overwhelming majority of the signal, but it's only possible to like coffee and not drink it you have a Doctorate (starred row).

Does education come high in relative influence? Are there other big predictors that could be explaining the massive univariate difference? Of course, it's always possible your model has a bug in it.

Related Solutions

Solved – interpreting y axis of a partial dependence plots

Each point on the partial dependence plot is the average vote percentage in favor of the "Yes trees" class across all observations, given a fixed level of TRI.

It's not a probability of correct classification. It has absolutely nothing to do with accuracy, true negatives, and true positives.

When you see the phrase

Values greater than TRI 30 begin to have a positive influence for classification in your model

is an puffed-up way of saying

Values greater than TRI 30 begin to predict "Yes trees" more strongly than values lower than TRI 30

Solved – Plotting results of ordered logistic regression analysis

Thanks to Wilco Emons for the following solution to the problem:

In polr the cumulative link model is parameterized a bit different than in Agresti’s book that is referred to. The problem can be easily solved by changing the code line:

probLALR[,k] <- inv.logit(b[k] +  a[1]*0 + a[2]*0 + a[3]*Pred + a[4]*0*0 + 
                                  a[5]*Pred*0 + a[6]*Pred*0                )

into

probLALR[,k] <- inv.logit(b[k] - (a[1]*0 + a[2]*0 + a[3]*Pred + a[4]*0*0 + 
                                  a[5]*Pred*0 + a[6]*Pred*0)               )

Thanks also to Achim Zeileis for his suggestion to use predict(m2, type="prob")! Below you will find a way to calculate the probabilities by the means of the predict() function:

Pred             <- seq(-3, 3, by=0.01)
PRED.LALR        <- data.frame(rep(NA,601))
PRED.LALR$f.adm  <- as.factor(rep(0,601))
PRED.LALR$f.riv  <- as.factor(rep(0,601))
PRED.LALR$RIV.st <- Pred
prob.LALR        <- predict(m2,PRED.LALR,type="prob")

scoreLALR        <- prob.LALR[,1]*1 + prob.LALR[,2]*2 + prob.LALR[,3]*3 + 
                    prob.LALR[,4]*4 + prob.LALR[,5]*5 + prob.LALR[,6]*6
plot(Pred, scoreLALR, col="green", ylim=c(1,6))

Best Answer

Related Solutions

Solved – interpreting y axis of a partial dependence plots

Solved – Plotting results of ordered logistic regression analysis

Related Question