Solved – How to achieve variational autoencoder (VAE) with unrestricted input

autoencoderskullback-leiblerlog-lossloss-functions

For a normal VAE an input and a reconstruction with values in the range of $[0, 1]$ are expected.
This is necessary since the log loss only makes sense for this range.
If the input is not within $[0, 1]$ it is common to normalize it with the help of statistics (for example mean-std normalization + logistic function).
That is not what I want. I actually want the VAE to have an unrestricted input and reconstruction.

How to formulate reconstruction loss and kullback-leibler (KL) divergence in this unrestricted setting?
The code $z$ is sampled from a normal distribution that is based on the mean $m$ and standard deviation $s$ given by a dense neural network layer.


My current thoughts:

The KL divergence is supposed to push the mean $m$ to $0$ and the standard deviation $s$ to $1$:

$m^2 + s^2 – log(s^2) – 1$

kullback-leibler-divergence for range 0-1

But that doesn't seem to make much sense if as a result the mean $m$ and std $s$ of the VAE encoding is not in the range $[0, 1]$.
In my current setup the range is more like $[-9, 8]$ for the mean and $[-2, -4]$ for the std.

kullback-leibler-divergence for range 0-1

So how to go about the KL divergence?

Will I need to restrict the mean and std values? But what are the effects and implications of that? How much would I restrict the VAE or rather how unrestricted is the VAE with that?


The reconstruction loss in a normal VAE punishes differences between input $x$ and reconstruction $y$ in the higher and lower end of the $[0, 1]$ range:

$-(x * \log(y) + (1 – x) * \log(1- y))$

log loss

For an unrestricted reconstruction loss the square difference could be used but how to combine / weight it against the KL divergence term?

Or are there other more useful reconstruction loss formulations? Are there loss formulations that follow a similar goal as the log loss but for unrestricted values (centered around 0) – and would that even make sense?


For clarification a simple VAE model that I'm using here:

tfd = tf.contrib.distributions

code_size = 5
input_size = input_tensor.shape[-1].value # input_tensor shape: [batch, values]
net = input_tensor

net = tf.layers.dense(net, units=input_size, activation=tf.nn.leaky_relu, name='Encoding_Layer_1')
net = tf.layers.dense(net, units=code_size * 2, activation=None, name='Encoding_Layer_2')

posterior = tfd.MultivariateNormalDiagWithSoftplusScale(net[:, :code_size], net[:, code_size:])

code = posterior.mean()
net = posterior.sample()

net = tf.layers.dense(net, units=input_size, activation=tf.nn.leaky_relu, name='Decoding_Layer_1')
net = tf.layers.dense(net, units=input_size, activation=None, name='Decoding_Layer_2')

reconstruction = net

image_loss = tf.reduce_mean(tf.square(input_tensor - reconstruction), 1)

prior = tfd.MultivariateNormalDiag(tf.zeros(code_size), tf.ones(code_size))
divergence = tfd.kl_divergence(posterior, prior)

loss = image_loss + 1e-4 * divergence

Best Answer

If you care about modeling unrestricted $x$, you shouldn't worry about $z$'s distribution much, so the KL term is not a problem here.

Instead, just replace your Bernoulli decoder $p(x|z)$ with a Gaussian one:

$$ p(x|z) = \mathcal{N}(\mu(z), \sigma(z)) $$

That is, multivariate normal distribution with mean and scale generated by a neural network from $z$.

Related Question