Autoencoders – Using a Variational AutoEncoder with an Inverse Bottleneck

autoencodersvariational inference

For a problem I'm dealing with, I'm trying to understand if my approach could make sense.

I'm using a Variational AutoEncoder (VAE) having relatively low-dimensional inputs, say $x \in \mathbb{R}^n$. Since my dimension $n$ is small, I'm trying to use the VAE by substantially inverting the role of encoder and decoder (inverse-bottleneck) to create high dimensional latent representation $z \in \mathbb{R}^m$ where $m > n$ and then reducing its dimension again obtaining the reconstruction $\hat{x}$.

So by the means of neural networks I'm substantially trying to approximate a decoder to the high-dimensional space $\phi : \mathbb{R}^ n \rightarrow \mathbb{R}^m$ and then an encoder back to the original space $\psi : \mathbb{R}^m \rightarrow \mathbb{R}^n$.

This usage of VAE is not common at all (it is used instead as dimensionality reduction approach). But I'm still trying to understand why my approach could eventually be wrong at the bottom from a mathematical perspective.

I mean, in general it's reasonable to project features in higher dimensional space to enhance separation (kernel methods), so here we are just learning the best way to project these inputs to a such space..

Best Answer

You did not say, what your actual final goal is. But in any case, what a VAE will learn in this setting, contrary to when a proper bottleneck is used, will simply be (a noisy version of) the identity. And the (noisy) values $z\in\mathbb{R}^m$ might also not be of much help, since it can actually be anything since there is no restriction. Think of any arbitrary mapping from $x\in\mathbb{R}^n$ to (a distribution in) $\mathbb{R}^m$ and this mapping could, in principle, be obtained by this VAE. So I don't see that the latent values would give useful representations of your data.

However, adding some constraints might be worthwhile. E.g. people have, IIRC, tried to get better models via "anti-bottlenecks" by penalizing denseness of the $z$-vectors (or, favoring sparsity). Or you could try to severely restrict the architecture of your network.

Related Question