I am using MLP neural network. My question is for training the neural network and testing it how much splitting of data is needed like is there any rule that I always have to split data 70% for training and 30% for testing when I did this my accuracy was not good as when I split it into 10% for training and 90% for testing I got more accuracy… Is this valid?
Solved – how many data should we choose for training and testing the neural network
machine learningneural networks
Related Solutions
In order to figure out whether or not more data will be helpful, you should compare the performance of your algorithm on the training data (i.e. the data used to train the neural network) to its performance on testing data (i.e. data the neural network did not "see" in training).
A good thing to check would be the error (or accuracy) on each set as a function of iteration number. There are two possibilities for the outcome of this:
1) The training error converges to a value significantly lower than than the testing error. If this is the case, the performance of your algorithm will almost certainly improve with more data.
2) The training error and the testing error converge to about the same value (with the training error still probably being slightly lower than the testing error). In this case additional data by itself will not help your algorithm. If you need better performance than you are getting at this point, you should try either adding more neurons to your hidden layers, or adding more hidden layers. If enough hidden units are added, you will find your testing error will become noticeably higher than the training error, and more data will help at that point.
For a more thorough and helpful introduction to how to make these decisions, I highly recommend Andrew Ng's Coursera course, particularly the "Evaluating a learning algorithm" and "Bias vs. Variance" lessons.
There are multiple loss functions you can use:
- MSE, aka Mean Squared Error: take all the errors, square them, and find the mean.
- RMSE, aka Root Mean Squared Error: Squared root of MSE.
- SSE, aka Sum of Squared Errors: take all the errors, square them, and compute their sum.
What your MSE value is telling you is that the square of the errors are, one average, 0.01026 units away from the true (test) values.
What the SSE is telling you instead is that the sum of the squares of all your errors (it's like a 'total amount of inaccuracy').
If you find these interpretations troublesome you can take the RMSE, which tells you how distant your predictions are, on average, from the true (test) values. This is, in my opinion, better than MSE, since RMSE is a mean computed on the same scale of your dependent variable.
Whether the values are good or not, well that's not something you can infer from those coefficients alone. These scores make more sense when you compare different models. In that case, by looking at the error coefficients, you can determine whether a model is better than another in explainins the same dependent variable.
Best Answer
I am a little bit confused. How is possible that if you have 10%of data for training and 90% for testing is less accurate than 30% for training and 70% for testing. From my experience of MLP ANN and my previous research this is not valid. In many papers I saw that most of researchers use half split for training and testing. I used various combinations. For example I used 9-fold cross valid where 2/9 are for train and 7/9 are for test. My opinion that % of train-test need to match segments (for example you have 15 subjects - samples from 10 subject use for train and samples from 5 use for test). There is no rule for splitting the data.
I hope that I helped you.