Variational Autoencoders


Amaires@May 2024
A Variational AutoEncoder (VAE) is an approach to generative modeling. In addition to its capability to generate new samples within the same population as existing ones, it provides a probabilistic way of describing samples in a latent space.

1 K-L Divergence

Generative modeling relies heavily on metrics of similarities between two distributions, among which the most commonly used is called the K-L divergence, short for Kullback–Leibler divergence. It is defined below for two distributions with probability density functions p1(x) and p2(x):

KL(p1(x),p2(x)) =p1(x)log p1(x) p2(x)dx (1)

K-L divergence has two important properties.

  1. It is obvious that K-L divergence is not symmetric in terms of p1(x) and p2(x).
  2. It is always non-negative, and it is 0 iff p1(x) and p2(x) are the same everywhere. To see why, we can break DL-divergence into two parts:

KL(p1(x),p2(x)) =p1(x)log p1(x) p2(x)dx =p1(x)log p1(x)dx p1(x)log p2(x)dx = p1(x)log p2(x)dx (p1(x)log p1(x)dx) (2)

The second term in (2), with the negative sign, is p1’s information theoretic entropy. The first term, also with the negative sign, is the cross entropy between p1 and p2. The first term is always no smaller than the second term per Gibb’s inequality.

2 Intuition

The concept of autoencoders predate the VAE. An autoencoder, shown in Figure 1, consists of an encoder Eϕ and a decoder D𝜃. Eϕ, a deep neural network parameterized by ϕ, takes a sample x from population 𝕏 and maps it to z = E𝜃(x) in . D𝜃, another deep neural network parameterized by 𝜃, aiming to reconstruct x, takes z as input and maps it to x~ = D𝜃(z) = D𝜃(Eϕ(x)). is usually of a lower dimension than 𝕏 , and thus Eϕ is considerd to posess some compression capability and unsupervised feature extraction capability.

PIC

Figure 1:Autoencoder

The training of an autoencoder minimizes the reconstruction loss: the expected L2 distance between x and x~:

min 𝜃,ϕ 1 n i xi x~i 2 = min 𝜃,ϕ 1 n i xi D𝜃(Eϕ(xi))2

Once trained, the decoder D𝜃, to some extent, is already a generative model in that it can create samples in 𝕏 given a sample z. The distribution of z or even the range of z, however, is unknown, which prevents its effective sampling. Ideally, we’d like z to follow some simple distribution, such as N(0,I), as it is easy to sample from. As summarized in Figure 2, VAE makes a few changes to the autoencoder architecture to make D𝜃 able to take samples from N(0,I) as input and map them to 𝕏.

PIC
Figure 2:Variational autoencoder

How exactly are μϕ(x) and σϕ2(x) penalized? Compute the K-L divergence between N(μϕ(x),σϕ2(x)) and N(0,I) as below:

KL(N(uϕ,σϕ2),N(0,I)) =N(uϕ,σϕ2)log N(uϕ,σϕ2) N(0,I) = 1 2 k=1d(μ ϕ,k2 + σ ϕ,k2 log σ ϕ,k2 1) (3)

In (3), d is the dimension of . Removing the constants from (3) and estimating it with samples, our final K-L divergence loss is

min ϕ 1 n i k=1d(μ ϕ,k2(x i) + σϕ,k2(x i) log σϕ,k2(x i)) (4)

The reconstruction loss for VAE, is also slightly different from that for a regular autoencoder. It can be estimated by the following equation, given a function S(μ,σ2) that returns a sample from N(μ,σ2).

min 𝜃,ϕ 1 n i xi x~i 2 = min 𝜃,ϕ 1 n i xi D𝜃(S(μϕ(xi),σϕ2(x i)))2

This formulation has one big problem. S(,) is not differentiable, which makes the reconstruction loss not amenable to back-propagation based optimization. Luckily, it is easy to rewrite S(μϕ(xi),σϕ2(xi)) as μϕ(xi) + S(0,I) σϕ(xi), where is the element-wise product and σϕ(xi) is σϕ2(xi)’s diagonals arranged in a vector form, by leveraging the reparameterization trick for normal distributions. The final formulation for the reconstruction loss therefore is

min 𝜃,ϕ 1 n i xi x~i 2 = min 𝜃,ϕ 1 n i xi D𝜃(μϕ(xi) + S(0,I) σϕ(xi)))2 (5)

The total loss combines the K-L divergence loss in (4) and the reconstruction loss in (5) with a weight hyperparameter λ:

min 𝜃,ϕ( 1 n i xi D𝜃(μϕ(xi) + S(0,I) σϕ(xi)))2) + λ1 n i k=1d(μ ϕ,k2(x i) + σϕ,k2(x i) log σϕ,k2(x i)) (6)

λ controls the relative importance between reconstructing the original samples and making sure z follows N(0,I). It is likely that different data sets require different λ.

In practice, instead of outputting σϕ2(xi), Eϕ outputs log σϕ2(xi), but that is only a minor engineering detail.

3 Bayesian View

This section derives the total loss objective function through a Bayesian view.

The maximum likelihood method is often used to optimize a neural network that takes samples xi as input and produces p𝜃(xi). Assuming each of these xi are i.i.d samples, the likelihood of observing all of them is p(x1,x2,x3,,xn) = p𝜃(xi). The training objective is to maximize p𝜃(xi), which is equivalent to minimizing the expected negative log odds:

min 𝜃 1 n i log p𝜃(xi)

Considering for now only the decoder D𝜃 part of VAE. It maps a sample z to x~, but it can also be viewed as spitting out parameters for p𝜃(x|z). More specifically it spits out μ𝜃(z), the parameters in N(x;μ𝜃(z),I). If the maximum likelihood method is to be used for finding the optimal 𝜃, p𝜃(x) is needed which can be calculated this way: p𝜃(x) =p𝜃(x,z)dz =p(z)p𝜃(x|z)dz = Ezp(z)p𝜃(x|z). Estimating p𝜃(x) this way, however, is intractable due to the number of dimensions z potentially has.

Assuming there is an effective way of sampling z that follows a distribution pϕ(z|x), which may or may not be equal to p𝜃(z|x), p𝜃(x) can be calcualted the following way:

p𝜃(x) =p𝜃(x,z)dz =pϕ(z|x)p𝜃(x,z) pϕ(z|x)dz = Ezpϕ(z|x)p𝜃(x,z) pϕ(z|x)

Note that in the derivation above, the only requirement of pϕ(z|x) is to be a valid probability density function. Is there such a pϕ(z|x) that is easy to sample from? Yes, that is exactly the responsibility of VAE’s encoder Eϕ which takes in x and spits out the parameters for normal distribution pϕ(z|x): μϕ(x) and σϕ2(x).

With p𝜃(x) estimated this way, log p𝜃(x) becomes:

log p𝜃(x) = log Ezpϕ(z|x)p𝜃(x,z) pϕ(z|x)

3.1 Change of Optimization Objective

With all the derivation steps, it is still not clear how to calcuate log p𝜃(x) precisely.

Given log () is a concave function, Jensen’s inequality states that log E(x) E(log (x)). We thus have:

log p𝜃(x) = log Ezpϕ(z|x)p𝜃(x,z) pϕ(z|x) Ezpϕ(z|x) log p𝜃(x,z) pϕ(z|x)

It is now possible to estimate the right hand side of the inequality, broadly known as the Evidence Lower BOund (ELBO), which can be rewritten further:

ELBO = Ezpϕ(z|x) log p𝜃(x,z) pϕ(z|x) = Ezpϕ(z|x) log p𝜃(x|z)p(z) pϕ(z|x) = Ezpϕ(z|x)[log p𝜃(x|z) + log p(z) log pϕ(z|x)] = Ezpϕ(z|x) log p𝜃(x|z) KL(pϕ(z|x),p(z)) (7)

Since p(z) = N(0,I), the second term in (7), as already calculated by (3) in Section 2, is:

KL(pϕ(z|x),p(z)) = 1 2 k=1d(μ ϕ,k2 + σ ϕ,k2 log σ ϕ,k2 1)

At the beginning of this section, p𝜃(x|z) is already required to take the form of N(μ𝜃(z),I) which means:

log p𝜃(x|z) = log (2π)d 2 exp (1 2 xi μ𝜃(z)2) = const 1 2 xi μ𝜃(z)2

If in the process of estimating Ezpϕ(z|x) log p𝜃(x|z), only one single sample z drawn which is equal to μϕ(xi) + S(0,I) σϕ(xi), the first term in (7) gives

Ezpϕ(z|x) log p𝜃(x|z) const 1 2 xi μ𝜃(μϕ(xi) + S(0,I) σϕ(xi)))2

Putting the maximum likelihood and the two terms of the ELBO together, we’ve arrived at:

min 𝜃 1 n i log p𝜃(xi) min 𝜃,ϕ 1 n iELBO = min 𝜃,ϕ 1 n i(const 1 2 xi μ𝜃(μϕ(xi) + S(0,I) σϕ(xi)))2 + const 1 2 k=1d(μ ϕ,k2 + σ ϕ,k2 log σ ϕ,k2)) = const + 1 2 1 n i(xi μ𝜃(μϕ(xi) + S(0,I) σϕ(xi))) + k=1d(μ ϕ,k2 + σ ϕ,k2 log σ ϕ,k2))

Removing the constants and 1 2, our final optimization objective shown above is identical to (6) in Section 2, keeping in mind that

3.2 The ELBO gap

Since the optimization objective is changed from the log likelihoods to the ELBO, it is helpful to understand the gap betwen the two.

log p𝜃(x) Ezpϕ(z|x) log p𝜃(x,z) pϕ(z|x) = Ezpϕ(z|x)(log p𝜃(x) log p𝜃(x,z) pϕ(z|x)) = Ezpϕ(z|x) log p𝜃(x)pϕ(z|x) p𝜃(x,z) = Ezpϕ(z|x) log pϕ(z|x) p𝜃(z|x) = KL(pϕ(z|x),p𝜃(z|x)) 0

The gap is 0 when pϕ(z|x) and p𝜃(z|x) are identical.

4 Joint Distribution View

Section 3’s derivation starts from the maximum likelihood objective, and then switches to maximizing the ELBO. This section provides a simpler joint distribution approach to derive the ELBO objective directly, inspired by Jianlin Su at http://kexue.fm.

In VAE, once trained, the decoder D𝜃 can be used as an independent generative model, without the encoder. The encoder can also be used without the decoder as a descriminative model. The training process is what links both components together. It is reasonable to require them to work on the same distribution about both x and z. That is, our objective is to minimize the K-L divergence between p𝜃(x,z) and pϕ(x,z):

min ϕ,𝜃KL(pϕ(x,z),p𝜃(x,z)) = min ϕ,𝜃pϕ(x,z)log pϕ(x,z) p𝜃(x,z)dzdx = min ϕ,𝜃[p(x)pϕ(z|x)log p(x)pϕ(z|x) p𝜃(x,z) dz]dx = min ϕ,𝜃p(x)[pϕ(z|x)log p(x)pϕ(z|x) p𝜃(x,z) dz]dx = min ϕ,𝜃Exp(x)[pϕ(z|x)log p(x)pϕ(z|x) p𝜃(x,z) dz] = min ϕ,𝜃[Exp(x)pϕ(z|x)log p(x)dz + Exp(x)pϕ(z|x)log pϕ(z|x) p𝜃(x,z)dz] = min ϕ,𝜃[Exp(x) log p(x) + Exp(x)pϕ(z|x)log pϕ(z|x) p𝜃(x,z)dz] = min ϕ,𝜃[const + Exp(x)pϕ(z|x)log pϕ(z|x) p𝜃(x,z)dz] = const + min ϕ,𝜃Exp(x)pϕ(z|x)log pϕ(z|x) p𝜃(x,z)dz = const + min ϕ,𝜃Exp(x)pϕ(z|x)log pϕ(z|x) p𝜃(x,z)dz = const + Exp(x)pϕ(z|x)log pϕ(z|x) p𝜃(x|z)p(z)dz = const + Exp(x)[Ezp(z|x) log p𝜃(x|z) + Ezp(z|x) log pϕ(z|x) p(z) ] = const + Exp(x)[Ezp(z|x) log p𝜃(x|z) + KL(pϕ(z|x),p(z))] = const + Exp(x)[ELBO]

The definition of the ELBO in Section 3 can be used to verify this.

5 Latent Space

In VAE’s training process, p(x) and p(z) = N(0,I) are given, VAE learns pϕ(z|x) and p𝜃(x|z) simultaneously. Note however that p𝜃(x) is never directly optimized to match p(x). This could be one major reason why VAE is not known to generate very realistic images.

VAE’s encoder, on the other hand, is a very reasonable feature extraction tool. Suppose there are a bunch of sample human face pictures labelled with whether the person has large eyes or not. Denote these samples by (x,y) where x is the image, and y = 1 if the person has large eyes and 0 otherwise. A vector e in calculated the following way probably captures the latent representation of large eyes.

e = Exp(x|y=0)μϕ(x) Exp(x|y=0)μϕ(x)

Given any human face picture x, μ𝜃(x + λe) should generate a variation of x that has big or small eyes as λ varies.