This post is kind of related to these two posts here and here
I learnt that model expected value (average prediction) is nothing but performance of the model when all input features are absent.
But I see that this expected value is calculated from a model with input features (whose values are not zero). But it is interpreted as model's simplest performance when X is not present.
I find this difficult to understand. Because, we compute mean(y_pred)
from a model that we build using X input features (and their values are not zero).You can refer the code sample from this answer where the average is computed using mean(y_pred)
. And this is how SHAP package does it as well.
we are taking average from a model which has non-zero values for all predictors? Can I know why is it done this way?
I see that in SHAP average/expected values are calculated this way only,
Can you help me understand why is it done this way?
Best Answer
First of all, the SHAP paper does not mention features being all zeros, but "would be predicted if we did not know any features" and that is not the same. It would be the same for linear regression where if in
$$ y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k + \varepsilon $$
you would set $X_1 = X_2 = \dots = X_k = 0$, the only thing that would remain is $\beta_0$, it would just predict the "average" $y$ value. It wouldn't necessarily be true for other models, but the SHAP paper mentions this by noticing that "most models cannot handle arbitrary patterns of missing input values".
Why does averaging the predictions tell you about the performance of the model as if the features were unknown? The details are given in the paper, but for intuition, you can think of it as integrating out the features. If your model $f$ approximates the conditional expectation
$$ E[y|X] = f(X) $$
then by the law of total expectation, we know that
$$ E[y] = E[E[y|X]] = \int_X E[y|X] \,dP $$
So if you average over the features, you are left with the marginal prediction. If you average over the features, it is like you didn't use the features. Of course, this is just an approximation, because the $X's$ you observed in the data are not necessarily representative of the whole distribution of $X$'s, but we make the assumption that our data is representative every time we use machine learning.