Solved – R2 diverging from MSE in Keras

machine learningmodel selectionneural networkspython

I'm currently performing variable selection in a network. My procedure was a derivation from Forward Selection. I have two main questions on that:

1- I started off fitting a model for each of the available independent variables(IVs), 39 in total, and then I picked the IV with the highest R2. As next step, I fitted a model for all the 38 left using each one of them plus the one I picked in the previous step.

At first I was using R2 adjusted, but I got stuck in only variable with an awful performance. So I started using R2 based on the assumption that R2 either increases its value by addition of new IV or keep the same value(when the model sets the coefficients for the new IV to zero) and would stop the addition of new IV's when either the R2'd keep the previous value or wouldn't not increase by an arbitrary threshold.

My database is very small (120×30), so I had to use cross validation in order to guarantee some robustness to the future model. I made sure to use the same folds at every new try with a +1 predictor, so I discard variability from the training data.

The problem is that when I moved from 4 IVs to 5, the R2 decreased. What got me wondering if the statement that R2 always increases or stays the same by addition of new IVs hold true for Neural Networks as well.

2 – When I made the same procedure as above, keeping training data all others parameters untouched, but tracking the usefulness of new predictors with MSE, I got somewhat the same results, but the order of some predictors was changed. Neural Networks are non-deterministic proccess, but given that all the others variables were fixed, shouldn't I get the same results?

Many thanks in advance and sorry for the giant question

Additional info: Keras on Windows 10 running on jupyter notebook

Code used for both questions:

# for instantiating a model every new predictor
def simple_model(hidden_layer, input_size):

model = keras.Sequential([
layers.Dense(hidden_layer, activation=tf.nn.sigmoid, input_shape=(input_size,),
            kernel_initializer = keras.initializers.RandomNormal(seed=0),
            bias_initializer = keras.initializers.RandomNormal(seed=0)),
layers.Dense(1,
            kernel_initializer = keras.initializers.RandomNormal(seed=0),
            bias_initializer = keras.initializers.RandomNormal(seed=0))])

model.compile(loss='mean_squared_error',
            optimizer='sgd',
            metrics=['mean_absolute_error', 'mean_squared_error'])
return model

# code for tracking addition new IV by R2
# available is a dict of the form: predictor : random_string_to_be_replaced_by_the_model

log_r2_predictor = []
for preditor in predictors:
    print("Instantiating model {}".format(preditor))
    r2t, r2val = 0,0
    for i in folds:
        available[preditor] = simple_model(5,1)
        available[preditor].fit(input.loc[i[0],preditor],output[i[0]], epochs = 2000, validation_split =0, verbose =0)
        s_treino = available[preditor].predict(input.loc[i[0],preditor])
        s_teste = available[preditor].predict(input.loc[i[1],preditor])
        r2t += r2_score(output[i[0]], s_treino)
        r2val += r2_score(output[i[1]], s_teste)
        log_r2_predictor.append([preditor, r2t/5, r2val/5])        

When tracking with MSE, I simply changed the r2t and r2val to mae_treino,mse_treino, mae_val and mse_val and the last 5 lines to:

_, mae_treino, mse_treino = available[preditor].evaluate(input.loc[i[0],preditor], output.loc[i[0]])
mae_treino  += mae_treino
mse_treino += mse_treino
_, mae_val, mse_val = available[preditor].evaluate(input .loc[i[1],preditor], output.loc[i[1]])
mae_val += mae_val
mse_val += mse_val
log_mae_predictor.append([preditor, mae_treino/5, mse_treino/5, mae_val/5, mse_val/5])        

Best Answer

  • Forward (or any other stepwise) selection is a bad method for selecting variables. Just don't. Using regularization (e.g. ridge regression - $\ell_2$ norm, lasso - $\ell_1$ norm, using dropout in neural networks) is a method that does variable selection for you and works.
  • If you have just 120 samples, then neural network is a poor choice of algorithm. You need something simpler, and more robust, like linear regression.
  • It is not strange at all that with such small sample your variable selection leaves you with a single variable, or just few of them. Your sample is small, so with more complicated model you would be easily overfitting. So this is a perfectly reasonable result. If, as I understand, you compared a number of models, and the one with single variable had relatively best performance, while the results are "awful", then maybe you don't have enough data to obtain better results?
  • $R^2 = 1 - \tfrac{MSE}{\mathrm{Var}(Y)}$, so unless you change something about the target variable (e.g. its distribution, by resampling the data) then they tell you the same thing, $R^2$ is just "normalized" MSE. Also be aware that $R^2$ is pretty misleading measure of error for non-linear models. The whole interpretation of "variance explained" does not apply in cases other then linear regression, and it can get values below zero, or above one, so it basically does not sum to $100\%$, what makes the units less usable.
Related Question