Variational Autoencoders

Amaires@May 2024

A Variational AutoEncoder (VAE) is an approach to generative modeling. In addition to its capability to generate new samples within the same population as existing ones, it provides a probabilistic way of describing samples in a latent space.

1 K-L Divergence

Generative modeling relies heavily on metrics of similarities between two distributions, among which the most commonly used is called the K-L divergence, short for Kullback–Leibler divergence. It is deﬁned below for two distributions with probability density functions $p_{1} (x)$ and $p_{2} (x)$ :

K L (p_{1} (x), p_{2} (x)) = \int p_{1} (x) \log \frac{p_{1} (x)}{p_{2} (x)} d x

(1)

K-L divergence has two important properties.

It is obvious that K-L divergence is not symmetric in terms of $p_{1} (x)$ and $p_{2} (x)$ .
It is always non-negative, and it is 0 iﬀ $p_{1} (x)$ and $p_{2} (x)$ are the same everywhere. To see why, we can break DL-divergence into two parts:

\begin{align} K L (p_{1} (x), p_{2} (x)) & = \int p_{1} (x) \log \frac{p_{1} (x)}{p_{2} (x)} d x \\ = \int p_{1} (x) \log p_{1} (x) d x - \int p_{1} (x) \log p_{2} (x) d x \\ = - \int p_{1} (x) \log p_{2} (x) d x - (- \int p_{1} (x) \log p_{1} (x) d x) & (2) \end{align}

The second term in (2), with the negative sign, is $p_{1}$ ’s information theoretic entropy. The ﬁrst term, also with the negative sign, is the cross entropy between $p_{1}$ and $p_{2}$ . The ﬁrst term is always no smaller than the second term per Gibb’s inequality.

2 Intuition

The concept of autoencoders predate the VAE. An autoencoder, shown in Figure 1, consists of an encoder $E_{ϕ}$ and a decoder $D_{𝜃}$ . $E_{ϕ}$ , a deep neural network parameterized by $ϕ$ , takes a sample $x$ from population $𝕏$ and maps it to $z = E_{𝜃} (x)$ in $ℤ$ . $D_{𝜃}$ , another deep neural network parameterized by $𝜃$ , aiming to reconstruct $x$ , takes $z$ as input and maps it to $\tilde{x} = D_{𝜃} (z) = D_{𝜃} (E_{ϕ} (x))$ . $ℤ$ is usually of a lower dimension than $𝕏$ , and thus $E_{ϕ}$ is considerd to posess some compression capability and unsupervised feature extraction capability.

The training of an autoencoder minimizes the reconstruction loss: the expected $L_{2}$ distance between $x$ and $\tilde{x}$ :

\min_{𝜃, ϕ} \frac{1}{n} \sum_{i} {∥ x_{i} - {\tilde{x}}_{i} ∥}^{2} = \min_{𝜃, ϕ} \frac{1}{n} \sum_{i} {∥ x_{i} - D_{𝜃} (E_{ϕ} (x_{i})) ∥}^{2}

Once trained, the decoder $D_{𝜃}$ , to some extent, is already a generative model in that it can create samples in $𝕏$ given a sample $z$ . The distribution of $z$ or even the range of $z$ , however, is unknown, which prevents its eﬀective sampling. Ideally, we’d like $z$ to follow some simple distribution, such as $N (0, I)$ , as it is easy to sample from. As summarized in Figure 2, VAE makes a few changes to the autoencoder architecture to make $D_{𝜃}$ able to take samples from $N (0, I)$ as input and map them to $𝕏$ .

Instead of giving out concrete samples in $ℤ$ , $E_{ϕ}$ outputs the parameters for the probability density function $p_{ϕ} (z | x)$ .
$p_{𝜃} (z | x)$ is required to be a multivariant normal distribution with independent components. That is, $p_{𝜃} (z | x) = N (μ_{ϕ} (x), σ_{ϕ}^{2} (x))$ where $σ_{ϕ}^{2} (x)$ is a diagonal matrix.
$μ_{ϕ} (x)$ is penalized for being diﬀerent from $0$ , and $σ_{ϕ}^{2} (x)$ for being diﬀerent from $I$ . With this penalty, $p_{𝜃} (z | x)$ approximately follows $N (0, I)$ , so does $p (z)$ as $p (z) = \int p (x) p (z | x) d x = \int p (x) N (z; 0, I) d x = N (z; 0, I)$ .
A new sampler component $S$ is introduced which, given $μ_{ϕ} (x)$ and $σ_{ϕ}^{2} (x)$ , draws a sample $z \sim N (μ_{ϕ} (x), σ_{ϕ}^{2} (x))$ . $z$ is then fed to the decoder $D_{𝜃}$ , just as in a regular autoencoder.

How exactly are $μ_{ϕ} (x)$ and $σ_{ϕ}^{2} (x)$ penalized? Compute the K-L divergence between $N (μ_{ϕ} (x), σ_{ϕ}^{2} (x))$ and $N (0, I)$ as below:

\begin{align} K L (N (u_{ϕ}, σ_{ϕ}^{2}), N (0, I)) & = \int N (u_{ϕ}, σ_{ϕ}^{2}) \log \frac{N (u_{ϕ}, σ_{ϕ}^{2})}{N (0, I)} \\ = \frac{1}{2} \sum_{k = 1}^{d} (μ_{ϕ, k}^{2} + σ_{ϕ, k}^{2} - \log σ_{ϕ, k}^{2} - 1) & (3) \end{align}

In (3), $d$ is the dimension of $ℤ$ . Removing the constants from (3) and estimating it with samples, our ﬁnal K-L divergence loss is

\min_{ϕ} \frac{1}{n} \sum_{i} \sum_{k = 1}^{d} (μ_{ϕ, k}^{2} (x_{i}) + σ_{ϕ, k}^{2} (x_{i}) - \log σ_{ϕ, k}^{2} (x_{i}))

(4)

The reconstruction loss for VAE, is also slightly diﬀerent from that for a regular autoencoder. It can be estimated by the following equation, given a function $S (μ, σ^{2})$ that returns a sample from $N (μ, σ^{2})$ .

\min_{𝜃, ϕ} \frac{1}{n} \sum_{i} {∥ x_{i} - {\tilde{x}}_{i} ∥}^{2} = \min_{𝜃, ϕ} \frac{1}{n} \sum_{i} {∥ x_{i} - D_{𝜃} (S (μ_{ϕ} (x_{i}), σ_{ϕ}^{2} (x_{i}))) ∥}^{2}

This formulation has one big problem. $S (\cdot, \cdot)$ is not diﬀerentiable, which makes the reconstruction loss not amenable to back-propagation based optimization. Luckily, it is easy to rewrite $S (μ_{ϕ} (x_{i}), σ_{ϕ}^{2} (x_{i}))$ as $μ_{ϕ} (x_{i}) + S (0, I) ⊙ σ_{ϕ} (x_{i})$ , where $⊙$ is the element-wise product and $σ_{ϕ} (x_{i})$ is $σ_{ϕ}^{2} (x_{i})$ ’s diagonals arranged in a vector form, by leveraging the reparameterization trick for normal distributions. The ﬁnal formulation for the reconstruction loss therefore is

\min_{𝜃, ϕ} \frac{1}{n} \sum_{i} {∥ x_{i} - {\tilde{x}}_{i} ∥}^{2} = \min_{𝜃, ϕ} \frac{1}{n} \sum_{i} {∥ x_{i} - D_{𝜃} (μ_{ϕ} (x_{i}) + S (0, I) ⊙ σ_{ϕ} (x_{i}))) ∥}^{2}

(5)

The total loss combines the K-L divergence loss in (4) and the reconstruction loss in (5) with a weight hyperparameter $λ$ :

\min_{𝜃, ϕ} (\frac{1}{n} \sum_{i} {∥ x_{i} - D_{𝜃} (μ_{ϕ} (x_{i}) + S (0, I) ⊙ σ_{ϕ} (x_{i}))) ∥}^{2}) + λ \frac{1}{n} \sum_{i} \sum_{k = 1}^{d} (μ_{ϕ, k}^{2} (x_{i}) + σ_{ϕ, k}^{2} (x_{i}) - \log σ_{ϕ, k}^{2} (x_{i}))

(6)

$λ$ controls the relative importance between reconstructing the original samples and making sure $z$ follows $N (0, I)$ . It is likely that diﬀerent data sets require diﬀerent $λ$ .

In practice, instead of outputting $σ_{ϕ}^{2} (x_{i})$ , $E_{ϕ}$ outputs $\log σ_{ϕ}^{2} (x_{i})$ , but that is only a minor engineering detail.

3 Bayesian View

This section derives the total loss objective function through a Bayesian view.

The maximum likelihood method is often used to optimize a neural network that takes samples $x_{i}$ as input and produces $p_{𝜃} (x_{i})$ . Assuming each of these $x_{i}$ are i.i.d samples, the likelihood of observing all of them is $p (x_{1}, x_{2}, x_{3}, \dots, x_{n}) = \prod p_{𝜃} (x_{i})$ . The training objective is to maximize $\prod p_{𝜃} (x_{i})$ , which is equivalent to minimizing the expected negative log odds:

\min_{𝜃} - \frac{1}{n} \sum_{i} \log p_{𝜃} (x_{i})

Considering for now only the decoder $D_{𝜃}$ part of VAE. It maps a sample $z$ to $\tilde{x}$ , but it can also be viewed as spitting out parameters for $p_{𝜃} (x | z)$ . More speciﬁcally it spits out $μ_{𝜃} (z)$ , the parameters in $N (x; μ_{𝜃} (z), I)$ . If the maximum likelihood method is to be used for ﬁnding the optimal $𝜃$ , $p_{𝜃} (x)$ is needed which can be calculated this way: $p_{𝜃} (x) = \int p_{𝜃} (x, z) d z = \int p (z) p_{𝜃} (x | z) d z = E_{z \sim p (z)} p_{𝜃} (x | z)$ . Estimating $p_{𝜃} (x)$ this way, however, is intractable due to the number of dimensions $z$ potentially has.

Assuming there is an eﬀective way of sampling $z$ that follows a distribution $p_{ϕ} (z | x)$ , which may or may not be equal to $p_{𝜃} (z | x)$ , $p_{𝜃} (x)$ can be calcualted the following way:

p_{𝜃} (x) = \int p_{𝜃} (x, z) d z = \int p_{ϕ} (z | x) \frac{p_{𝜃} (x, z)}{p_{ϕ} (z | x)} d z = E_{z \sim p_{ϕ} (z | x)} \frac{p_{𝜃} (x, z)}{p_{ϕ} (z | x)}

Note that in the derivation above, the only requirement of $p_{ϕ} (z | x)$ is to be a valid probability density function. Is there such a $p_{ϕ} (z | x)$ that is easy to sample from? Yes, that is exactly the responsibility of VAE’s encoder $E_{ϕ}$ which takes in $x$ and spits out the parameters for normal distribution $p_{ϕ} (z | x)$ : $μ_{ϕ} (x)$ and $σ_{ϕ}^{2} (x$ ).

With $p_{𝜃} (x)$ estimated this way, $\log p_{𝜃} (x)$ becomes:

\log p_{𝜃} (x) = \log E_{z \sim p_{ϕ} (z | x)} \frac{p_{𝜃} (x, z)}{p_{ϕ} (z | x)}

3.1 Change of Optimization Objective

With all the derivation steps, it is still not clear how to calcuate $\log p_{𝜃} (x)$ precisely.

Given $\log (\cdot)$ is a concave function, Jensen’s inequality states that $\log E (x) \geq E (\log (x))$ . We thus have:

\log p_{𝜃} (x) = \log E_{z \sim p_{ϕ} (z | x)} \frac{p_{𝜃} (x, z)}{p_{ϕ} (z | x)} \geq E_{z \sim p_{ϕ} (z | x)} \log \frac{p_{𝜃} (x, z)}{p_{ϕ} (z | x)}

It is now possible to estimate the right hand side of the inequality, broadly known as the Evidence Lower BOund (ELBO), which can be rewritten further:

\begin{align} E L B O & = E_{z \sim p_{ϕ} (z | x)} \log \frac{p_{𝜃} (x, z)}{p_{ϕ} (z | x)} \\ = E_{z \sim p_{ϕ} (z | x)} \log \frac{p_{𝜃} (x | z) p (z)}{p_{ϕ} (z | x)} \\ = E_{z \sim p_{ϕ} (z | x)} [\log p_{𝜃} (x | z) + \log p (z) - \log p_{ϕ} (z | x)] \\ = E_{z \sim p_{ϕ} (z | x)} \log p_{𝜃} (x | z) - K L (p_{ϕ} (z | x), p (z)) & (7) \end{align}

Since $p (z) = N (0, I)$ , the second term in (7), as already calculated by (3) in Section 2, is:

K L (p_{ϕ} (z | x), p (z)) = \frac{1}{2} \sum_{k = 1}^{d} (μ_{ϕ, k}^{2} + σ_{ϕ, k}^{2} - \log σ_{ϕ, k}^{2} - 1)

At the beginning of this section, $p_{𝜃} (x | z)$ is already required to take the form of $N (μ_{𝜃} (z), I)$ which means:

\log p_{𝜃} (x | z) = \log {(2 π)}^{- \frac{d}{2}} \exp (- \frac{1}{2} {∥ x_{i} - μ_{𝜃} (z) ∥}^{2}) = c o n s t - \frac{1}{2} {∥ x_{i} - μ_{𝜃} (z) ∥}^{2}

If in the process of estimating $E_{z \sim p_{ϕ} (z | x)} \log p_{𝜃} (x | z)$ , only one single sample $z$ drawn which is equal to $μ_{ϕ} (x_{i}) + S (0, I) ⊙ σ_{ϕ} (x_{i})$ , the ﬁrst term in (7) gives

E_{z \sim p_{ϕ} (z | x)} \log p_{𝜃} (x | z) \approx c o n s t - \frac{1}{2} {∥ x_{i} - μ_{𝜃} (μ_{ϕ} (x_{i}) + S (0, I) ⊙ σ_{ϕ} (x_{i}))) ∥}^{2}

Putting the maximum likelihood and the two terms of the ELBO together, we’ve arrived at:

\begin{array}{l} \min_{𝜃} - \frac{1}{n} \sum_{i} \log p_{𝜃} (x_{i}) & \leq \min_{𝜃, ϕ} - \frac{1}{n} \sum_{i} E L B O \\ = \min_{𝜃, ϕ} - \frac{1}{n} \sum_{i} (c o n s t - \frac{1}{2} {∥ x_{i} - μ_{𝜃} (μ_{ϕ} (x_{i}) + S (0, I) ⊙ σ_{ϕ} (x_{i}))) ∥}^{2} + c o n s t - \frac{1}{2} \sum_{k = 1}^{d} (μ_{ϕ, k}^{2} + σ_{ϕ, k}^{2} - \log σ_{ϕ, k}^{2})) \\ = c o n s t + \frac{1}{2} \cdot \frac{1}{n} \sum_{i} (∥ x_{i} - μ_{𝜃} (μ_{ϕ} (x_{i}) + S (0, I) ⊙ σ_{ϕ} (x_{i}))) ∥ + \sum_{k = 1}^{d} (μ_{ϕ, k}^{2} + σ_{ϕ, k}^{2} - \log σ_{ϕ, k}^{2})) \end{array}

Removing the constants and $\frac{1}{2}$ , our ﬁnal optimization objective shown above is identical to (6) in Section 2, keeping in mind that

$μ_{𝜃}$ and $D_{𝜃}$ are the same funcion, and
with this theoretical foundation, the need for a $λ$ is also eliminated.

3.2 The ELBO gap

Since the optimization objective is changed from the log likelihoods to the ELBO, it is helpful to understand the gap betwen the two.

\begin{array}{l} \log p_{𝜃} (x) - E_{z \sim p_{ϕ} (z | x)} \log \frac{p_{𝜃} (x, z)}{p_{ϕ} (z | x)} & = E_{z \sim p_{ϕ} (z | x)} (\log p_{𝜃} (x) - \log \frac{p_{𝜃} (x, z)}{p_{ϕ} (z | x)}) \\ = E_{z \sim p_{ϕ} (z | x)} \log \frac{p_{𝜃} (x) p_{ϕ} (z | x)}{p_{𝜃} (x, z)} \\ = E_{z \sim p_{ϕ} (z | x)} \log \frac{p_{ϕ} (z | x)}{p_{𝜃} (z | x)} \\ = K L (p_{ϕ} (z | x), p_{𝜃} (z | x)) \\ \geq 0 \end{array}

The gap is 0 when $p_{ϕ} (z | x)$ and $p_{𝜃} (z | x)$ are identical.

4 Joint Distribution View

Section 3’s derivation starts from the maximum likelihood objective, and then switches to maximizing the ELBO. This section provides a simpler joint distribution approach to derive the ELBO objective directly, inspired by Jianlin Su at http://kexue.fm.

In VAE, once trained, the decoder $D_{𝜃}$ can be used as an independent generative model, without the encoder. The encoder can also be used without the decoder as a descriminative model. The training process is what links both components together. It is reasonable to require them to work on the same distribution about both $x$ and $z$ . That is, our objective is to minimize the K-L divergence between $p_{𝜃} (x, z)$ and $p_{ϕ} (x, z)$ :

\begin{array}{l} \min_{ϕ, 𝜃} K L (p_{ϕ} (x, z), p_{𝜃} (x, z)) & = \min_{ϕ, 𝜃} \int \int p_{ϕ} (x, z) \log \frac{p_{ϕ} (x, z)}{p_{𝜃} (x, z)} d z d x \\ = \min_{ϕ, 𝜃} \int \int [p (x) p_{ϕ} (z | x) \log \frac{p (x) p_{ϕ} (z | x)}{p_{𝜃} (x, z)} d z] d x \\ = \min_{ϕ, 𝜃} \int p (x) [\int p_{ϕ} (z | x) \log \frac{p (x) p_{ϕ} (z | x)}{p_{𝜃} (x, z)} d z] d x \\ = \min_{ϕ, 𝜃} E_{x \sim p (x)} [\int p_{ϕ} (z | x) \log \frac{p (x) p_{ϕ} (z | x)}{p_{𝜃} (x, z)} d z] \\ = \min_{ϕ, 𝜃} [E_{x \sim p (x)} \int p_{ϕ} (z | x) \log p (x) d z + E_{x \sim p (x)} \int p_{ϕ} (z | x) \log \frac{p_{ϕ} (z | x)}{p_{𝜃} (x, z)} d z] \\ = \min_{ϕ, 𝜃} [E_{x \sim p (x)} \log p (x) + E_{x \sim p (x)} \int p_{ϕ} (z | x) \log \frac{p_{ϕ} (z | x)}{p_{𝜃} (x, z)} d z] \\ = \min_{ϕ, 𝜃} [c o n s t + E_{x \sim p (x)} \int p_{ϕ} (z | x) \log \frac{p_{ϕ} (z | x)}{p_{𝜃} (x, z)} d z] \\ = c o n s t + \min_{ϕ, 𝜃} E_{x \sim p (x)} \int p_{ϕ} (z | x) \log \frac{p_{ϕ} (z | x)}{p_{𝜃} (x, z)} d z \\ = c o n s t + \min_{ϕ, 𝜃} E_{x \sim p (x)} \int p_{ϕ} (z | x) \log \frac{p_{ϕ} (z | x)}{p_{𝜃} (x, z)} d z \\ = c o n s t + E_{x \sim p (x)} \int p_{ϕ} (z | x) \log \frac{p_{ϕ} (z | x)}{p_{𝜃} (x | z) p (z)} d z \\ = c o n s t + E_{x \sim p (x)} [- E_{z \sim p (z | x)} \log p_{𝜃} (x | z) + E_{z \sim p (z | x)} \log \frac{p_{ϕ} (z | x)}{p (z)}] \\ = c o n s t + E_{x \sim p (x)} [- E_{z \sim p (z | x)} \log p_{𝜃} (x | z) + K L (p_{ϕ} (z | x), p (z))] \\ = c o n s t + E_{x \sim p (x)} [- E L B O] \end{array}

The deﬁnition of the ELBO in Section 3 can be used to verify this.

5 Latent Space

In VAE’s training process, $p (x)$ and $p (z) = N (0, I)$ are given, VAE learns $p_{ϕ} (z | x)$ and $p_{𝜃} (x | z)$ simultaneously. Note however that $p_{𝜃} (x)$ is never directly optimized to match $p (x) .$ This could be one major reason why VAE is not known to generate very realistic images.

VAE’s encoder, on the other hand, is a very reasonable feature extraction tool. Suppose there are a bunch of sample human face pictures labelled with whether the person has large eyes or not. Denote these samples by $(x, y)$ where $x$ is the image, and $y = 1$ if the person has large eyes and 0 otherwise. A vector $e$ in $ℤ$ calculated the following way probably captures the latent representation of large eyes.

e = E_{x \sim p (x | y = 0)} μ_{ϕ} (x) - E_{x \sim p (x | y = 0)} μ_{ϕ} (x)

Given any human face picture $x$ , $μ_{𝜃} (x + λ e)$ should generate a variation of $x$ that has big or small eyes as $λ$ varies.