neural-networks – Training Set Increase and Its Impact on Accuracy in Neural Networks

adamneural networkspredictive-modelsrbf-network

I've ~16000 labeled data. I split it in ~8000 for training and ~8000 for testing of my RBF neural network, find the best hyperparameters (RMSE from 1.2 to 1.4*) and finally train the model on the whole set of ~16000 labeled data. When I now run these 16000 samples through the final model, I get way worse result (RMSE from 2.0 to 2.3*). How is this possible?

My RBF has 350 neurons in the input layer, 2 in the output one and two hidden layers with 50 and 30 neurons respectively. I use gaussian distribution for weight initialization, sigmoid as an activation functions in the hidden layers and RELU in the last one, Adam to optimize and L2 as loss function.

* depending on the random seed used for train-test split and weight init

EDIT: I use batches of 400 samples and 500 epochs. So I have 10 000 training iteration when I use 8k samples for training and 20 000 when I use the whole dataset.

Best Answer

There are some reasons that could lead to that happening yes, one of such reasons could be due to the increased variance in the data (more outliers or a distributional shift after incorporating the other half of the entire dataset), you might want to employ outlier detection to detect and remove those examples from your dataset.

Also, running the train/test once (a validation scheme usually called holdout) is often not a good measure of the actual performance of any model (it may lead to unstable models both in hyperparameters and accuracy), specially in non-deterministic training models like neural networks that are very sensitive to initialization (which may also be the culprit for your problem too).

Therefore I'd suggest you to try more robust validation schemes, as at least it will give you a good grasp of what is the actual performance of your model:

Holdout with 50% split may be not enough for your model to generalize properly. Usually when validating through holdout, it's recommended to use more data to train than to test, 80/20, 75/25 and even 60/40 splits are way more common.
Instead of holdout, use a proper k-cross-validation scheme, you can start with 10-fold cross-validation (divide data into 10 groups, then run the training from start using 1 of the 10 to test and the others to train, each run which group is the test is changed, then evaluate the average performance obtained of all runs), since i don't know the task i'm not able to really propose something tailored for your situation, you might need to do a nested or stratified cross-validation...

Once you're satisfied with your validation scheme, I'd devise an environment to test values obtained for each hyperparameters to investigate them as well.

Related Solutions

Solved – Neural net with hidden layer performing worse than without

Here are my thoughts on what could be going wrong:

Accuracy (what is being measured)

Perhaps your network is in fact doing well.

Let's consider binomial classification. If we have 50-50 distribution of labels, then 50% accuracy means the model is no better than chance (flipping a coin). If the Bernoulli distribution is 80%-20% and the accuracy is 50%, then the model is worse than chance.

No matter what I try, I'm not seeing better than 20% accuracy when I add a hidden layer.

If the accuracy is 20%, just negate the output and you have 80% accuracy, well done! (well at least for the binomial case).

Not so fast!

I believe that in your case the accuracy is misleading. This is a good read on the matter. For classification, the AUC (area under the curve) is often used. It's common to also examine the Receiver operating characteristic (ROC) and the confusion matrix.

For the multi-class case this becomes more tricky. Here is an answer that I found. Ultimately, this involves a strategy of 1-vs-rest or 1-vs-1 pairs, more on that here.

Pre-processing

Are the features scaled? Do they have the same bounds? e.g [0,1]
Have you tried standardizing the features? This renders each feature normally distributed with zero mean and unit variance.
Perhaps normalization might help? Dividing each input vector by it's norm places it on the unit circle (for L2 norm) and also bounds the features (but scaling should be performed first otherwise the larger numbers will spike).

Training

As to the learning rate and momentum, if you're not in a big hurry, I would just set a low learning rate and the algorithm will converge better (although slower). This is valid for stochastic gradient descent where examples are shown at random (are you shuffling the data?). From your code I can't figure out how this happens. Are you going one pass only through the training data? For SGD, multiple iterations are made. Perhaps try smaller batches? Have you tried weight decay as a regularization method?

Architecture

Cross-entropy as loss function: check. Softmax at outputs: check.

Might be a longshot at this point but have you tried projection to a higher dimension in the first hidden layer then collapsing to a lower space in the next one two hidden layers?

There is also the cost in your output, I wonder if it could be scaled to make more sense. I would try to plot the evolution of the cost (log loss here) and see if it fluctuates or how steep it is. Your network might be stuck in a local minima plateau. Or it might be doing very well in which case double check the metric?

Hope this helped or generated some new ideas.

EDIT:

Example of how normalization (L2) can make things worse when features are not scaled relative to the other features. Plots for one sample:

In the left image the blue line is a vector of 10 values generated randomly with a mean zero and std of 1. In the right image I added an 'outlier' or out of scale feature no.6 where I set its value to 10. Clearly out of scale. When we normalize the out of scale vector, all other features become very close to 0 as it can be seen in the orange line on the right.

Standardizing the data might be a good thing to do before anything else in this case. Try plotting some histograms of the features or box plots.

You mentioned you are normalizing the vectors to sum up to 1 and now it works better with 10. That means you are dividing by the 1-norm = sum(abs(X)) instead of the 2-norm (Euclidean) = sum(abs(X).^2)^(1/2). The L1 normalization generates sparser vectors, look at the figure below, where each axis is one feature, so this is a two dimensional space, however it can be generalized to an arbitrary number of dimensions.

Normalizing effectively places each vector on the edge of either shape. For L1 it will lie on the diamond somewhere. For L2 on the circle. When it hits the axis it is zero.

Solved – Training Autoencoder with Softmax Layer

I don't know why this was downvoted, but I figured out the answer though it may be obvious.

The training set is used to train one compression/encoder layer by learning to approximate itself using the training set.

Once this is done, the weights / layer that is responsible for the encoding part is saved and paired with a classification layer (e.g. softmax layer) to learn a supervised classifier. This is done by using the same training set as before and fitting them with labels / classes of this training set that weren't used previously.

After the classifier is trained, it can be used to make predictions or check performance using the test set.

For example, if you already had an autoencoder trained and wanted to use the encoding layer with a softmax layer, you could do the following with keras:

# For a single-input model with 10 classes (categorical classification):

model = Sequential()
model.add(autoencoder.layers[1])
model.add(Dense(10, activation='softmax'))
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Convert labels to categorical one-hot encoding
one_hot_labels = keras.utils.to_categorical(y_train, num_classes=10)

# Train the model, iterating on the data in batches of 32 samples
model.fit(x_train, one_hot_labels, epochs=10, batch_size=32)

# Overall F1 score
f1_score(y_test, np.argmax(model.predict(x_test), axis=1), average='macro')

In the stacked autoencoder case, the procedure is the same except with more encoding layers. Discussion about this using keras here and here.

Best Answer

Related Solutions

Solved – Neural net with hidden layer performing worse than without

Solved – Training Autoencoder with Softmax Layer

Related Question