Neural-Networks – About Update Procedure in Data Incremental Learning

learningneural networksstochastic gradient descent

As far as I understood, the idea of data incremental learning consists of keeping the model always up to date. Suppose that we trained a model for user recognition using voice as input. Therefore, the input is the voice of users and the output is the label of users (user 1, 2, …). After a certain time (say years for example), there could be changes in the input distribution of users, therefore we need to adapt our base model.

The idea seems me to be the idea of stochastic gradient learning (in deep learning) where we only use one data point at a time to update model parameters.

However, my question is, in order to update the model with new test data we have to have the label of tests? In real case scenarios, how this can be possible?

Edit 1: An idea comes to my mind, maybe the solution is this (not sure at all)?

One idea comes to my mind. Suppose that we have a deep model trained for user 1, 2 and 3. Then, there is a new input arrived. Our model predicts it (with the highest probability, consider softmax result of a deep network for example) as user 2. Therefore, in loss, while backpropagation, as true label for that new coming data (we have not that label in reality) we use the predicted label as ground truth. So, suppose that the softmax outputs:

0.1 for belonging in class of user 1
0.7 for belonging in class of user 2
0.2 for belonging in class of user 3

user 2 has the highest probability and we consider the true labels for that incoming data as:

0 for class of user 1
1 for class of user 2
0 for class of user 3

Therefore the cross entropy loss is calculated as follows:

Loss = - (0*ln(0.1) + 1*ln(0.7) + 0*ln(0.2)) = -ln(0.7) = 0.357

Then we backpropagate this error throughout the network.

I need verification, is this what is done for data incremental learning processes in real life?

Best Answer

Say that you observed three datapoints (0, 1, 1) and your model is trivial, you're just predicting the “probability of success” for a Bernoulli distribution using maximum likelihood. Given your initial sample, the probability is 2/3 so for the next outcome you “label” it as one, and the new probability is 3/4. After gathering 100 more samples, the estimated probability is 103/104. It's just a trivial example, but it shows like your procedure just makes the model to mode collapse. Your model would become more and more certain of its predictions regardless of whether they are correct or not. If you want to update your model with new data, you need to observe the labels as well.

For a less trivial example, say that your true function is $y = x^2$ (red line), you observed three points from this function and fitted linear regression to the points (prediction with prediction intervals are shown using dotted curves).

The true function is y = x^2. The linear model is fitted to the three points. Predicted linear function with prediction intervals is shown.

> x <- c(0.8, 1.3, 1.4)
> y <- x^2
> fit <- lm(y~x)
> summary(fit)

Call:
lm(formula = y ~ x)

Residuals:
        1         2         3 
 0.004839 -0.029032  0.024194 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -1.09903    0.10022  -10.97   0.0579 .
x            2.16774    0.08381   25.86   0.0246 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.0381 on 1 degrees of freedom
Multiple R-squared:  0.9985,    Adjusted R-squared:  0.997 
F-statistic:   669 on 1 and 1 DF,  p-value: 0.0246

Next, you observe five points without labels c(-2, -1, 0, 1, 2) (blue points) and use model predictions as labels. As you can see below, standard errors for the parameters and the predictions intervals shrank to near zero. The predictions made by the model are the same as previously because the model "observed" the data that was exactly the same as it would predict. The model didn't learn anything new, it just echoed and amplified its predictions.

Same plot for the new data. Prediction intervals shrank to near zero.

> curve(x^2, from=-4, to=4, ylab="", xlab="", col="red", ylim=c(-5, 15))
> points(x, y)
> lines(grid, pred[,'fit'], lty=2)
> lines(grid, pred[,'lwr'], lty=3)
> lines(grid, pred[,'upr'], lty=3)
> 
> x.2 <- c(x, c(-2, -1, 0, 1, 2))
> y.2 <- c(y, predict(fit, newdata=data.frame(x=c(-2, -1, 0, 1, 2))))
> fit.2 <- lm(y~x, data=data.frame(x=x.2, y=y.2))
> summary(fit.2)

Call:
lm(formula = y ~ x, data = data.frame(x = x.2, y = y.2))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.02903  0.00000  0.00000  0.00121  0.02419 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.099032   0.005820  -188.8 1.49e-12 ***
x            2.167742   0.004355   497.8 4.44e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01555 on 6 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 2.478e+05 on 1 and 6 DF,  p-value: 4.435e-15
Related Question