Solved – Learning just a decoder (autoencoder without encoder)

autoencoderslatent-variablelikelihoodmaximum likelihoodneural networks

I am trying to do something quite unusual: learning a latent representation of some data just by optimizing a decoder. Basically, a probabilistic model of a neural network autoencoder without the encoder.

My approach is to set a prior on the latent codes and then minimize the negative log likelihood via SGD. Ideally this will find the best weights and the best settings for the latent variables.

The purpose is to learn a latent manifold in a simpler way than with an autoencoder.

I think this would work because the lower-dimensionality latent representation, combined with a network with a low number of hidden nodes, will enforce "locality" (is this the correct term?), which means that similar datapoints will be located in nearby values in the lower-dimensional manifold.

In my experiments the latent representations are updated mostly in the very early epochs (first or second), while the weights of the neural network are being updated with a larger degree even in later epochs.

The question is: do you have any suggestions on why this would or would not work? Any pointers in literature that describe similar models and similar phenomena?

Best Answer

It's an interesting idea. Some miscellaneous thoughts:

If you allow arbitrary mappings from data space to representation, then it seems that your method is more flexible than autoencoders, and includes autoencoders as a special case (i.e. when the mapping exactly matches the encoder portion of an autoencoder). This means that, when an autoencoder can solve the problem, there exists a configuration of your method that can also solve the problem (but doesn't guarantee that this configuration is learnable). It might also be the case that some configurations of your method can solve problems that autoencoders cannot, with the same caveat.

If you have $n$ data points and your latent space has dimensionality $d$, then you'd have $nd$ parameters for the latent representations, plus the parameters of the decoder network. If your original data has dimensionality $d'$, an autoencoder would have $d d'$ parameters for the encoder, plus parameters for the decoder. In the case where you have many more data points than dimensions, your approach may be much more flexible than an autoencoder. This could be beneficial, but also means you might have to worry more about overfitting and possibly runtime. Unlike an autoencoder, it's not obvious that you could escape overfitting by adding more data, since the number of parameters scales with number of data points. It seems like constraints from the 'bottleneck'/decoder would have to be the thing that saves it.

Your optimization problem will most likely be nonconvex, and SGD will converge to a local minimum. My gut feeling is that the results might be highly dependent on the initial configuration, particularly of the initial latent representations.

Compared to autoencoders, your method won't give a mapping from the data space to the latent space, so it can't directly perform out-of-sample generalization (although this may not be needed, depending on your goal). This is common to many nonlinear dimensionality reduction algorithms. If necessary, you could use some auxilliary method to learn such a mapping. It does mean you'd have to think about how to perform validation (e.g. how would you test your algorithm on a held-out data set?).

There's certainly precedent for treating the latent representations of every data point as parameters, and directly optimizing them. Nonclassical multidimensional scaling (MDS) is one example. The objective function is nonconvex, and points are iteratively repositioned in the latent space to find a local minimum. It often helps to initialize with classical MDS (which is convex and can be solved in one shot using an eigendecomposition). Many nonlinear dimensionality reduction methods also treat the latent representations of every point as parameters.

Related Question