Solved – Why can’t this autoencoder reach zero loss

autoencoderskerasneural networkstensorflow

I'm building an autoencoder and was wondering why the loss didn't converge to zero after 500 iterations. So I created this "illustrative" autoencoder with encoding dimension equals to the input dimension. To make sure that there was nothing wrong with the data, I created a random array sample of shape (30000, 100) and fed it as input and output (x = y). The NN is just supposed to learn to keep the inputs as they are. So why doesn't it reach zero loss?

x_rand = numpy.random.rand(30000, 100)
# this is the size of our encoded representations
encoding_dim = 100

inputs = Input(shape=x_rand.shape[1:])

encoded = Dense(100, activation='relu')(inputs)
encoded = Dense(100, activation='relu')(encoded)

encoded = Dense(encoding_dim, activation='relu')(encoded)

decoded = Dense(100, activation='relu')(encoded)
decoded = Dense(100, activation='relu')(decoded)
decoded = Dense(x_rand.shape[-1], activation='sigmoid')(decoded)

# this model maps an input to its reconstruction
autoencoder = Model(inputs, decoded)

autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
history = autoencoder.fit(x_rand, x_rand, epochs=EPOCHS, batch_size=BATCH_SIZE, verbose=2)

Best Answer

To succinctly answer the titular question: "This autoencoder can't reach 0 loss because there is a poor match between the inputs and the loss function. Training the same model on the same data with a different loss function, or training a slightly modified model on different data with the same loss function achieves zero loss very quickly."

Simplify

Whenever I find puzzling behavior, I find it's helpful to strip it down to the most basic problem and solve that problem. You've started that process with your toy model, but I believe the model can be simplified even further.

The simplest version of this problem is a single-layer network with identity activations; this is a linear model. The encoder is a linear transformation (weight matrix and bias vector) and the decoder is another linear transformation (weight matrix and bias vector).

However, all of these models retain the property that there is no bottleneck: the embedding dimension is as large as the input dimension.

Linear model, $x \sim \mathcal{U}(0,1)$

So this dirt-simple model looks like

$$ \hat{x} = W_\text{dec}(W_\text{enc}x + b_\text{enc})+b_\text{dec} $$

Using the following configuration, this model converges to a training loss less than $10^{-5}$ in fewer than 450 iterations:

Adam optimizer with learning rate $10^{-5}$
30,000 samples of two features
minibatch size of 128
MSE loss function.

Sigmoid, $x \sim \mathcal{U}(0,1)$

Using a sigmoid activation in the final layer and BCE loss does not seem to work as well. The sigmoid model has the form $$ \hat{x} = \sigma\left(W_\text{dec}(W_\text{enc}x + b_\text{enc})+b_\text{dec}\right) $$

Adam optimizer with learning rate $10^{-4}$
30,000 samples of two features
minibatch size of 128
BCE loss function
sigmoid activation function

I think this model doesn't work well with the source data because the targets are uniform on $[0,1]$ instead of being concentrated at 0 and 1. (Recall that one way to justify the use of the log-loss function is that it naturally arises from the Bernoulli likelihood.) I've tried many variations on learning rate and model complexity, but this model with this data does not achieve a loss below about 0.5.

We can test my hypothesis by attempting to estimate the same model using randomly-generated binary inputs.

Sigmoid, $x \sim \text{Bernoulli}(0.5)$

However, if we change the way the data is constructed to be random binary values, then using BCE loss with the sigmoid activation does converge.

Adam optimizer with learning rate $10^{-4}$
30,000 samples of two features
mini-batch size of 128
BCE loss function.
sigmoid activation in the final layer

This model achieves low loss very quickly. If you want to press for extremely small loss values, my advice is to compute loss on the logit scale to avoid roundoff issues.

Complicate

Now that we have a hypothesis of how the model works when the model is dirt-simple and cheap to estimate, we can increase the complexity of the simple model and test whether or not our hypothesis that we developed from the simpler models still still holds when we attempt more complex models.

I've conducted experiments with deeper models, nonlinear activations (leaky ReLU), but repeating the same experimental design used for training the simple models: mix up the choice of loss function and compare alternative distributions of input data. In these experiments with larger, nonlinear models, I find that it's best to match MSE to continuous-valued inputs and log-loss to binary-valued inputs.

Bottlenecks

If we desire to train a model using a bottleneck encoder/decoder structure, that is, a model where the output of the encoder has smaller dimension than the input dimension, we must consider whether our source data is structured so to make such compression possible. All of our experiments so far have used iid random values, which are the least compressible because the values of one feature have no information about the values of any other feature by construction.

Alternatively, suppose the input data were completely redundant values, so one example might be $[1,1,1,1]$ and another example is $[2,2,2,2]$ and another is $[-1.5, -1.5, -1.5, -1.5]$. A bottleneck network would fit it easily, since three columns are entirely redundant. This kind of source data would be more amenable to a bottleneck auto-encoder.

from __future__ import division

import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset


class ToyAutoEncoder(nn.Module):
  def __init__(self, n_features, encoding_dim):
    super(ToyAutoEncoder, self).__init__()
    self.encoder = nn.Sequential(
      nn.Linear(n_features, n_features),
      nn.LeakyReLU(0.1),
      nn.Linear(n_features, encoding_dim),
    )

    self.decoder = nn.Sequential(
      nn.Linear(encoding_dim, n_features),
      nn.LeakyReLU(0.1),
      nn.Linear(n_features, n_features),
      # nn.Sigmoid()
    )

  def forward(self, x):
    x_enc = self.encoder(x)
    x_dec = self.decoder(x_enc)
    return x_dec


def fit(model, data_queue, optimizer, n_epoch, print_freq, tol=1e-5):
  # loss_fn = nn.BCELoss(reduction="mean")
  loss_fn = nn.MSELoss(reduction="mean")
  for epoch_ndx in range(n_epoch):
    model.train()
    loss_epoch = np.zeros(len(queue))
    loss_recent = np.zeros(print_freq)
    for batch_ndx, (x_batch,) in enumerate(data_queue):
      optimizer.zero_grad()
      x_dec = autoencoder(x_batch)
      loss_tensor = loss_fn(input=x_dec, target=x_batch)
      loss_tensor.backward()
      optimizer.step()
      loss_np = loss_tensor.detach().numpy()
      loss_epoch[batch_ndx] = loss_np
      loss_recent[batch_ndx % print_freq] = loss_np
      if batch_ndx % print_freq == 0 and batch_ndx > 0:
        msg_data = (epoch_ndx, batch_ndx, loss_recent.mean(), loss_epoch[:batch_ndx].mean())
        print("Epoch %d Batch %d Recent loss %.4f Cumulative Loss %.4f" % msg_data)
    if loss_epoch.mean() < tol:
      print("Convergence achieved: %s" % loss_epoch.mean())
      break


if __name__ == "__main__":
  N_FEATURES = ENCODING_DIM = 10
  N_EPOCH = 1024
  PRINT_FREQ = 32
  autoencoder = ToyAutoEncoder(n_features=N_FEATURES, encoding_dim=ENCODING_DIM, )

  x_rand = torch.rand((30000, N_FEATURES))
  # x_rand = (torch.rand((30000, N_FEATURES)) - 0.5).sign()
  x_rand = x_rand.clamp(min=0.0)

  optim = torch.optim.Adam(lr=1e-4, params=autoencoder.parameters(),
                           # weight_decay=1e-6,
                           )
  # optim = torch.optim.SGD(lr=1e-3, params=autoencoder.parameters(), momentum=0.9)

  queue = DataLoader(TensorDataset(x_rand), batch_size=128, shuffle=True)
  fit(model=autoencoder, data_queue=queue, optimizer=optim, n_epoch=N_EPOCH, print_freq=PRINT_FREQ)

Related Solutions

Solved – Cannot make this autoencoder network function properly (with convolutional and maxpool layers)

You might gain more insight by visualizing the weights instead of just the reconstructions. I had a similar problem when my biases were misconfigured. Everything below is written based on my experiences writing my own learning library. You can see the code here on Github http://github.com/josephcatrambone/aij.

Here is a screenshot of my program when there are no biases. This is after only maybe ten epochs since I'm in a hurry to finish this writeup:

The weight update is done by these operations:

weights.add_i(positiveProduct.subtract(negativeProduct).elementMultiply(learningRate / (float) batchSize));
//visibleBias.add_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));
//hiddenBias.add_i(positiveHiddenProbabilities.subtract(negativeHiddenProbabilities).meanRow().elementMultiply(learningRate));

If I uncomment the visible bias code, I get this result:

If I screw up the sign of the visible bias code (subtracting instead of adding):

visibleBias.subtract_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));

I get this image:

Which snowballs and eventually reaches something like what you have above. Check the signage of your error functions.

Solved – Denoising Autoencoder not training properly

Hyperparameter choice is something that can't really be answered: sure, there are some set of procedures that can be followed, but it's largely a case of hit and trial.

Single DA's can indeed extract meaningful features, however, most of the features in case of encoding dimensions 'L' (say) > input dimensions 'D' (i.e. Overcomplete learning) will end up being random noise. The reason for your autoencoder not learning meaningful features is because given the degree of freedom the autoencoder has in the encoding layer (i.e. L > D) it becomes quite easy for the autoencoder to learn an identity mapping of the input.

So to alleviate this problem, you have to put additional constraints in order to limit this degree of freedom. I believe you can try the following and see what the outcome is:

The first and probably the easiest step would be to try and reduce the number of encoding layer nodes from 1000 to something little closer to the dimensions of the input, ie. 784. I would say 800 would be a good start. Visualize the features then and see if some features have improved.
Apply additional regularization constraints, say l2 regularization on the weights (and if already doing that, increase the penalty term corresponding to l2) and other such penalization techniques.
Tied weights. Use tied weights on the encoding layer and the decoding layer if not doing so already. ie. W_decoding = W_encoding.T. When not using tied weights, many times, either of the two layers learn larger, better weights (for the lack of words) and compensate for the poor weights learned by the other. By placing this constraint we force the autoencoder to learn a balanced set of weights. Also, it often results in improvement of training time as well as a pretty good limitation on the degree of freedom (the number of free, trainable parameters is halved!).

Give this a try. Might help.