As far as I understood, the idea of data incremental learning consists of keeping the model always up to date. Suppose that we trained a model for user recognition using voice as input. Therefore, the input is the voice of users and the output is the label of users (user 1, 2, …). After a certain time (say years for example), there could be changes in the input distribution of users, therefore we need to adapt our base model.
The idea seems me to be the idea of stochastic gradient learning (in deep learning) where we only use one data point at a time to update model parameters.
However, my question is, in order to update the model with new test data we have to have the label of tests? In real case scenarios, how this can be possible?
Edit 1: An idea comes to my mind, maybe the solution is this (not sure at all)?
One idea comes to my mind. Suppose that we have a deep model trained for user 1, 2 and 3. Then, there is a new input arrived. Our model predicts it (with the highest probability, consider softmax result of a deep network for example) as user 2. Therefore, in loss, while backpropagation, as true label for that new coming data (we have not that label in reality) we use the predicted label as ground truth. So, suppose that the softmax outputs:
0.1 for belonging in class of user 1
0.7 for belonging in class of user 2
0.2 for belonging in class of user 3
user 2 has the highest probability and we consider the true labels for that incoming data as:
0 for class of user 1
1 for class of user 2
0 for class of user 3
Therefore the cross entropy loss is calculated as follows:
Loss = - (0*ln(0.1) + 1*ln(0.7) + 0*ln(0.2)) = -ln(0.7) = 0.357
Then we backpropagate this error throughout the network.
I need verification, is this what is done for data incremental learning processes in real life?
Best Answer
Say that you observed three datapoints (0, 1, 1) and your model is trivial, you're just predicting the “probability of success” for a Bernoulli distribution using maximum likelihood. Given your initial sample, the probability is 2/3 so for the next outcome you “label” it as one, and the new probability is 3/4. After gathering 100 more samples, the estimated probability is 103/104. It's just a trivial example, but it shows like your procedure just makes the model to mode collapse. Your model would become more and more certain of its predictions regardless of whether they are correct or not. If you want to update your model with new data, you need to observe the labels as well.
For a less trivial example, say that your true function is $y = x^2$ (red line), you observed three points from this function and fitted linear regression to the points (prediction with prediction intervals are shown using dotted curves).
Next, you observe five points without labels
c(-2, -1, 0, 1, 2)
(blue points) and use model predictions as labels. As you can see below, standard errors for the parameters and the predictions intervals shrank to near zero. The predictions made by the model are the same as previously because the model "observed" the data that was exactly the same as it would predict. The model didn't learn anything new, it just echoed and amplified its predictions.