Generative Adversarial Networks

Amaires@June 2024

Generative Adversarial Networks (GAN) are an approach to generative artiﬁcial intelligence. It is the ﬁrst known model to produce new photorealistic images automatically.

1 Original GAN

With data samples $ẋ$ that follow a certain unknown distribution $p (x)$ , the idea is to have a neural network generator $G_{𝜃}$ , parameterized by $𝜃$ , which transforms samples $z \sim N (0, I)$ to $\tilde{x}$ that follows $p (x)$ . Figure 1 depicts the architecture.

In order to train $G_{𝜃}$ , some optimization objective is needed to give $G_{𝜃}$ feedback about whether $\tilde{x}$ really looks like it is drawn from $p (x)$ . A binary descriminator/classiﬁer $D_{ϕ}$ that returns the probability that $\tilde{x}$ follows $p (x)$ would serve the purpose. Figure 2 shows the architecture we have so far.

Figure 2:Generator $G_{𝜃}$ and Discriminator $D_{ϕ}$

The introduction of $D_{ϕ}$ merely defers the responsibility of guiding $G_{𝜃}$ . How do we train $D_{ϕ}$ ? We could feed it both real samples $(x, y) = (ẋ, 1)$ and artiﬁcial samples $(x, y) = (\tilde{x}, 0)$ created by $G_{𝜃}$ . Figure 3 shows the complete GAN architecture.

The loss function $ℒ$ of $D_{ϕ}$ is the common binary cross entropy function, shown below (see amaires.github.io/OptimizationObjective/ for a refresher):

ℒ = \min_{𝜃} - E_{x, y \sim p (x, y)} (y \log D_{ϕ} (x) + (1 - y) \log (1 - D_{ϕ} (x))

(1)

Remove the negation in (1) and rewrite it in conditional expectation form:

\begin{array}{l} ℒ & = \max_{ϕ} E_{y \sim p (y)} E_{x \sim p (x | y)} (y \log D_{ϕ} (x) + (1 - y) \log (1 - D_{ϕ} (x)) \\ = \max_{ϕ} [P r (y = 1) E_{x \sim p (x | y = 1)} \log D_{ϕ} (x) + P r (y = 0) E_{x \sim p (x | y = 0)} \log (1 - D_{ϕ} (x)] \end{array}

If the same number of real samples $ẋ$ and artiﬁcial samples $\tilde{x}$ are fed to $D_{ϕ}$ in each batch/mini-batch, $P r (y = 1) = P r (y = 0) = \frac{1}{2}$ . Also note that $x \sim p (x | y = 1)$ is the same as $ẋ \sim p (ẋ)$ and $x \sim p (x | y = 0)$ is the same as $\tilde{x} \sim p_{𝜃} (\tilde{x} | z)$ , $ℒ$ can be further written:

\begin{array}{l} ℒ & = \max_{ϕ} [\frac{1}{2} E_{ẋ \sim p (ẋ))} \log D_{ϕ} (ẋ) + \frac{1}{2} E_{\tilde{x} \sim p_{𝜃} (\tilde{x})} \log (1 - D_{ϕ} (\tilde{x})] \\ L & = \max_{ϕ} [\frac{1}{2} E_{ẋ \sim p (ẋ))} \log D_{ϕ} (ẋ) + \frac{1}{2} E_{z \sim N (0, I)} E_{\tilde{x} \sim p_{𝜃} (\tilde{x} | z)} \log (1 - D_{ϕ} (\tilde{x}))] \end{array}

After removing the constant

\frac{1}{2}

, if for every sample

z

, only one sample is drawn for

\tilde{x}

, which is what

G_{𝜃}

does,

ℒ

becomes

ℒ = \max_{ϕ} [E_{ẋ \sim p (ẋ))} \log D_{ϕ} (ẋ) + E_{z \sim N (0, I)} \log (1 - D_{ϕ} (G_{𝜃} (z))]

Note that the objective is parameterized by both $𝜃$ and $ϕ$ , but it is maximized only in terms of $ϕ$ , the paramters for the discriminator. It is also worth noting that $D_{ϕ}$ gives the probability if a sample follows $p (x)$ , but not $p (x)$ itself. Also, $G_{𝜃}$ generates a sample $\tilde{x}$ , not $p_{𝜃} (\tilde{x})$ .

The process so far optimizes $ϕ$ to make $D_{ϕ}$ better at telling real samples apart from samples created by a ﬁxed $G_{𝜃}$ . For each batc/mini-batch of samples, $ϕ$ takes a gradient ascent step. This process does not seem to improve $G_{𝜃}$ whatsoever. If $𝜃$ also takes a gradient ascent step using $\nabla_{𝜃} ℒ$ , it also makes $D_{ϕ}$ better. In this case, it is probably equivalent to making samples created by $G_{𝜃}$ more obviously fake, the opposite of what we want. The solution to this last piece of the GAN puzzle is to update $𝜃$ with a gradient descent step. More formally, our objective function is

ℒ = \min_{𝜃} \max_{ϕ} [E_{ẋ \sim p (ẋ))} \log D_{ϕ} (ẋ) + E_{z \sim N (0, I)} \log (1 - D_{ϕ} (G_{𝜃} (z))]

Intuitively, $D_{𝜃}$ tries to tell real samples apart from artiﬁcial samples, and $G_{𝜃}$ tries to create artiﬁcial samples that are hard to distinguish from real samples, hence the name generative adversarial network.

In GAN, $p (x)$ and $p_{𝜃}$ are never explicitly modeled. $G_{𝜃}$ can produce good samples following $p_{𝜃} (x)$ , but it does not know the form of $p_{𝜃} (x)$ . In other words, $G_{𝜃}$ deﬁnes a sampling process without knowing its distribution.

1.1 Jenson-Shannon Divergence

For a given $G_{𝜃}$ , $D_{ϕ}$ ’s objective is to maximize

\begin{align} L_{𝜃, ϕ} & = E_{x \sim p (x)} \log D_{ϕ} (x) + E_{x \sim p_{𝜃} (x)} (1 - \log D_{ϕ} (x)) \\ = \int [p (x) \log D_{ϕ} (x) + p_{𝜃} (x) \log (1 - D_{ϕ} (x))] d x & (2) \end{align}

Let $l_{𝜃, ϕ} (x)$ be the function under the integral in (2). Assuming $D_{ϕ}$ is ﬂexible and powerful enough, when $L_{𝜃, ϕ}$ is maximized with parameter $ϕ^{*}$ , $l_{𝜃, ϕ^{*}} (x)$ is also maximized everywhere. Take $l_{𝜃, ϕ} (x)$ ’s derivative against $D_{ϕ} (x)$ , we have

\frac{d l_{𝜃, ϕ} (x)}{d D_{ϕ} (x)} = \frac{p (x)}{D_{ϕ} (x)} - \frac{p_{𝜃} (x)}{1 - D_{ϕ} (x)}

Set the expression to 0, you derive $D_{ϕ^{*}} (x)$ :

D_{ϕ^{*}} (x) = \frac{p (x)}{p (x) + p_{𝜃} (x)}

(3)

Substitute $D_{ϕ} (x)$ with (3) in (2), we have:

\begin{array}{l} L_{𝜃} & = L_{𝜃, ϕ^{*}} \\ = \int [p (x) \log D_{ϕ^{*}} (x) + p_{𝜃} (x) \log (1 - D_{ϕ^{*}} (x))] d x \\ = \int [p (x) \log \frac{p (x)}{p (x) + p_{𝜃} (x)} + p_{𝜃} (x) \log \frac{p_{𝜃} (x)}{p (x) + p_{𝜃} (x)}] d x \\ = \int [p (x) \log \frac{\frac{1}{2} p (x)}{\frac{1}{2} (p (x) + p_{𝜃} (x))} + p_{𝜃} (x) \log \frac{\frac{1}{2} p_{𝜃} (x)}{\frac{1}{2} (p (x) + p_{𝜃} (x))}] d x \\ = \int p (x) \log \frac{p (x)}{\frac{1}{2} (p (x) + p_{𝜃} (x))} d x + \int p_{𝜃} (x) \log \frac{p_{𝜃} (x)}{\frac{1}{2} (p (x) + p_{𝜃} (x))} d x - \log 2 \int p (x) d x - \log 2 \int p_{𝜃} (x) d x \\ = D L [p (x), \frac{p (x) + p_{𝜃} (x)}{2}] + D L [p_{𝜃} (x), \frac{p (x) + p_{𝜃} (x)}{2}] - \log 4 \end{array}

The ﬁrst two terms is actually twice the Jenson-Shannon Divergence (JSD) deﬁned as:

J S (p_{1}, p_{2}) = \frac{1}{2} [D L (p_{1}, \frac{p_{1} + p_{2}}{2}) + D L (p_{2}, \frac{p_{1} + p_{2}}{2})]

Using JSD, $L_{𝜃}$ is further simpliﬁed as:

L_{𝜃} = 2 J S (p (x), p_{𝜃} (x)) - \log 4

Now it becomes clear that the generator $G_{𝜃}$ is really optimizing the J-S divergence between $p_{𝜃} (x)$ and $p (x)$ given the optimal $D_{ϕ *}$ .

The Jenson-Shannon Divergence has a few properties as well.

$J S D (p, q) \geq 0$
$J S D (p, q) \geq 0$ iﬀ $p = q$
Unlike the K-L divergence, $J S D (p, q)$ is symmetric. That is $J S D (p, q) = J S D (q, p)$ .

1.2 Contrast Maximization

When $D_{ϕ}$ is trained reasonably well, which means $D_{ϕ} (x)$ is close to 1 for real samples, and is close to 0 for artiﬁcal samples, $L_{𝜃, ϕ}$ can be shown to maximize the contrast of $D_{ϕ} (x)$ between real and artiﬁcal samples, as shown below.

\begin{align} L_{𝜃, ϕ} & = E_{x \sim p (x)} \log D_{ϕ} (x) + E_{x \sim p_{𝜃} (x)} (1 - \log D_{ϕ} (x)) \\ ≃ E_{x \sim p (x)} (D_{ϕ} (x) - 1) + E_{x \sim p_{𝜃} (x)} - D_{ϕ} (x) & (4) \\ = E_{x \sim p (x)} D_{ϕ} (x) - E_{x \sim p_{𝜃} (x)} D_{ϕ} (x) - 1 \end{align}

(4) uses the ﬁrst derivative of $\log (\cdot)$ at $\log (1)$ for approximation. Of course, when $D_{𝜃}$ is not a very good discriminator yet, the above does not hold.

2 f-GAN

too much math and too little practical impact to write about... will pick up later

3 WGAN

3.1 Problems with GAN

GAN is known to generate very impressive photorealistic images, but it also has a few well documented drawbacks.

The ﬁrst is a problem known as mode collapse. The discriminator $D_{ϕ}$ only cares about distinguishing real samples from artiﬁcial samples. It does not care about whether those artiﬁcial samples have broad coverage or not. For example, suppose the real samples include images of diﬀerent animals such as cats, dogs, and horses. $G_{𝜃}$ is happy to generate pictures of only dogs as long as these pictures become more real each time $𝜃$ is updated. $D_{ϕ}$ is also perfectly happy with $G_{𝜃}$ ’s behavior as long as each time $ϕ$ is updated, $D_{ϕ}$ can tell real animal pictures from these artiﬁcal dog picutres a little better.

The second notorious problem with GAN is its diﬃculty to train. Unlike other deep learning problems which see their loss function decreases gradually until converging to 0 in training, GAN’s minimax objective does not oﬀer any such guarantee. In practice, GAN’s objective keeps oscillating during training. Deciding when to stop training is often a manual process. Another reason that prevents GAN’s eﬀective training has to do with $D_{ϕ}$ ’s ﬁnal activation function, typically sigmoid. When real samples and artiﬁcial samples are far apart, it is easy for $D_{ϕ}$ to distinguish them. In this case, sigmoid’s outputs are very close to 1 for real samples, and 0 for artiﬁcial samples. Its derivative is very close to 0, a problem known as vanishing gradient. These 0 gradients cannot provide eﬀective back-propagation for $G_{𝜃}$ to improve its sample generating process.

3.2 Intuition of WGAN

The inspiration of Wasserstein GAN (WGAN) comes from contrast maximization described in Section 1.2 and sigmoid’s vanishing gradient problem described in Section 3.1. WGAN’s objective is

\min_{𝜃} \max_{ϕ} L_{𝜃, ϕ} = \min_{𝜃} \max_{ϕ} [E_{x \sim p (x)} D_{ϕ} (x) - E_{x \sim p_{𝜃} (x)} D_{ϕ} (x)]

Basically, $D_{ϕ}$ tries to maximize its output for real samples and minimize its output for artiﬁcial samples. $G_{𝜃}$ on the other hand tries to generate artiﬁcal samples that also get large outputs.

This objective looks just like the contrast maximization loss in Section (1.2), but there are two main diﬀerences: it is no longer an approximation in a narrow range and sigmoid is not used to compress $D_{ϕ}$ ’s output to between 0 and 1. Completely removing $D_{ϕ}$ ’s output value constraint may, however, pose another problem; $L_{𝜃, ϕ}$ may grow very rapidly out of bound and may still take the form of a sigmoid function and stiﬂe back-propagation. Ideally, we’d like $D_{ϕ}$ to behave roughly like a linear function of $x$ . Given deep neural networks are diﬀerentiable for almost all input, we could force the norm of $D_{ϕ}$ ’s gradient to be close to 1 everywhere:

∥ \nabla_{x} D_{ϕ} (x) ∥ ≃ 1

This constraint can be added to $L_{𝜃, ϕ}$ as a penalty term:

{(∥ \nabla_{x} D_{ϕ} (x) ∥ - 1)}^{2}

This penalty term needs to be numerically computable. Averaging over all possible value of $x$ is out of the question, but one possbility is to average it over both the real samples and the artiﬁcial samples, as shown below:

L_{𝜃, ϕ} = E_{x \sim p (x)} D_{ϕ} (x) - E_{x \sim p_{𝜃} (x)} D_{ϕ} (x) - λ E_{x \sim p (x)} {(∥ \nabla_{x} D_{ϕ} (x) ∥ - 1)}^{2} - λ E_{x \sim p_{𝜃} (x)} {(∥ \nabla_{x} D_{ϕ} (x) ∥ - 1)}^{2}

where $λ$ is the knob adjusting the relative importance between maximizing contrast and making $D_{ϕ}$ roughly a linear function.

A variant of the above formulation creates samples by randomly and linearly interpolating between real and artiﬁcial samples, and averages the penalty term over these samples instead. The ﬁnal objective function, expressed in numerical computation form, is the following:

L_{𝜃, ϕ} = \frac{1}{N} \sum_{i} D_{ϕ} (ẋ_{i}) - \frac{1}{N} \sum_{i} D_{ϕ} ({\tilde{x}}_{i}) - \frac{λ}{N} \sum_{i} {(∥ \nabla_{x} D_{ϕ} [𝜀_{i} ẋ_{i} + (1 - 𝜀_{i} {\tilde{x}}_{i})] ∥ - 1)}^{2}

where $𝜀_{i} \sim U (0, 1)$ . GAN with this gradient norm penalty is called Wasserstein Generative Adversarial Network - Gradient Penalty (WGAN-GP).

Both $D_{ϕ}$ and $G_{𝜃}$ aﬀects the penalty term. Though it is not mentioned in the original WGAN-GP paper, I don’t think it is desirable to add the penalty term to $G_{𝜃}$ ’s loss function. WGAN-GP’s objective should instead be

\begin{array}{l} \max_{ϕ} [\frac{1}{N} \sum_{i} D_{ϕ} (ẋ_{i}) - \frac{1}{N} \sum_{i} D_{ϕ} ({\tilde{x}}_{i}) - \frac{λ}{N} \sum_{i} {(∥ \nabla_{x} D_{ϕ} [𝜀_{i} ẋ_{i} + (1 - 𝜀_{i}) {\tilde{x}}_{i}] ∥ - 1)}^{2}] \\ \min_{𝜃} [\frac{1}{N} \sum_{i} D_{ϕ} (ẋ_{i}) - \frac{1}{N} \sum_{i} D_{ϕ} ({\tilde{x}}_{i})] \end{array}

There are a couple more things worth noting.

$\nabla_{x} D_{ϕ} (x)$ is the derivative against $x$ , not against $ϕ$ . What is used in back-propagation to update $ϕ$ includes ${(∥ \nabla_{x} D_{ϕ} (x) ∥ - 1)}^{2}$ ’s derivative against $ϕ$ .
Obviously, with WGAN, $G_{𝜃}$ no longer minimizes the Jenson-Shannon divergence even given an optimal $D_{ϕ^{*}}$ .

3.3 Math of WGAN

3.3.1 Inﬁmum and supremum

Most people are familiar with the concept of maximum and minimum. Explicitly, if $X$ is a (partially) ordered set and $S$ a subset, then $\bar{s}$ is the maximum of $S$ iﬀ $\bar{s} \in S$ and $s \leq \bar{s}$ for all $s \in S$ . Similarly, $\underset{̲}{s}$ is the minimum of $S$ iﬀ $\underset{̲}{s} \in S$ and $s \geq \underset{̲}{s}$ for all $s \in S$ .

The supremum (sup) of $S$ can be deﬁned like this. Let $T = {t \in X | s \leq t \forall s \in S}$ , which deﬁnes the set of elements greater than all members of $S$ . If $T$ is empty, $S$ ’s supremum does not exist, otherwise it is the mininum of $T$ . If S has a maximum, it must be the same as $S$ ’s supremum. Even if $S$ does not have a maximum, it may still have a supremum. Below are three examples comparing maximum and supremum.

$S = {x | x \leq 2}$ : $S$ ’s maximum is 2, and its supremum is 2 as well.
$S = {x | x < 2}$ : $S$ does not have a maximum, but its supremum is 2.
$S = {x | x > 2}$ : $S$ has neither a maximum nor a supremum.

Similar comparisons can be made between minimum and inﬁmum. Informally, if one uses supremum and maximum interexchangeably, little is lost. The same goes for inﬁmum and minimum.

3.3.2 Wasserstein Distance

K-L divergence and J-S divergence are often used to measure the closeness between two distributions. In fact, VAE (amaires.github.io/VAE) uses K-L divergence for optimization and GAN’s $G_{𝜃}$ minimizes the J-S divergence given an optimal $D_{ϕ^{*}}$ . Unfortunately, these two measures have discontinuity when two distributions have disjoint supports (the support of a function is the subset of the function domain not mapped to 0). For example, given two distributions deﬁned below:

p (x) = {\begin{matrix} 1 & x = 0 \\ 0 & x \neq 0 \end{matrix} a n d p_{𝜃} (x) = {\begin{matrix} 1 & x = 𝜃 \\ 0 & x \neq 𝜃 \end{matrix}

It is not hard to ﬁgure out their K-L and J-S divergence:

K L (p, p_{𝜃}) = {\begin{matrix} 0 & 𝜃 = 0 \\ \infty & 𝜃 \neq 0 \end{matrix} a n d J S (p, p_{𝜃}) = {\begin{matrix} 0 & 𝜃 = 0 \\ \log 2 & 𝜃 \neq 0 \end{matrix}

Ideally, we’d like a measure that is smoother. Wasserstein distance is exactly such a function deﬁned as:

W S (p_{1}, p_{2}) = \inf_{γ \in \prod (p_{1}, p_{2})} E_{(x, y) \sim γ} {∥ x - y ∥}_{1}

where $\prod (p_{1}, p_{2})$ contains all joint distribution of $(x, y)$ such that $p_{1} (x) = \int γ (x, y) d y$ and $p_{2} (y) = \int γ (x, y) d x$ . Wasserstein distance is also called earth-mover distance. It informally captures the minimal amount of mass/dirt needs to be moved to turn the shape of $p_{1}$ into that of $p_{2}$ . Using this deﬁnition, the Wasserstain distance between $p (x)$ and $p_{𝜃} (x)$ above can be calculated to be $| 𝜃 |$ , which is a continuous function of $𝜃$ .

In general, however, Wasserstein distance is intractable to calculate. Fortunately, Wasserstein distance has another deﬁnition, based on the Kantorovich-Rubinstein duality, which is easier to handle:

W S (p_{1}, p_{2}) = \sup_{{∥ f ∥}_{L} \leq 1} E_{x \sim p_{1}} f (x) - E_{x \sim p_{2}} f (x)

(5)

where $f (\cdot)$ is any real valued function. ${∥ f ∥}_{L} \leq 1$ means $f$ ’s Lipschitz constant is 1. Technically, it means

| f (x) - f (y) | \leq {∥ x - y ∥}_{1} \forall x, y

Intuitively, it is equivalent to saying $f (x)$ ’s value should not change too much as $x$ changes.

Wasserstein distance (5) really captures the contrast maximization idea in Section 1.2 well. The Lipschitz continuity constraint can be approximated by the gradient penalty term introduced in 3.2.

3.3.3 GAN and WGAN

GAN’s $D_{ϕ}$ optimizes maximum likelihood of observing samples $ẋ$ and $\tilde{x}$ . Given an optional $D_{ϕ^{*}}$ , $G_{𝜃}$ minimizes the J-S divergence between $p_{𝜃} (x)$ and $p (x)$ .

In WGAN, $D_{ϕ}$ maximizes the Wasserstein distance between $ẋ$ and $\tilde{x}$ while $G_{𝜃}$ tries to reduce it. Wasserstein distance is a smoother and more eﬀective measure of closeness between two distributions, resulting in more stable training and less mode collapse in WGAN than GAN.

4 Latent Representation and BiGAN

The training of Variational Autoencoders produces both a generator and a encoder. The latter is capable of extracting features or latent representations of data. GAN only has a generator $G_{𝜃}$ . It is conceivable that the pre-ﬁnal layers of the discriminator $D_{ϕ}$ may be used for feature representations. The intuition is that $D_{ϕ}$ would have learned useful high-level representations of both real and artiﬁcial samples.

Bidirectional Generative Adversarial Networks (BiGAN), depicted in Figure 4, takes a much more direct approach to latent representation. It introduces an encoder $E_{γ}$ that maps real samples $ẋ$ to $\tilde{z}$ in the latent space. $D_{ϕ}$ works on the joint distribution of $(x, z)$ and tries to maximize the contrast between real samples $(ẋ, \tilde{z})$ and artiﬁcial samples $(\tilde{x}, z)$ . The loss function introduced in WGAN-GP can be used here. After training is done, new samples can be generated by $G_{𝜃}$ , and latent representations can be inferred via $E_{γ}$ .

5 Image-to-image translation and CycleGAN

A GAN’s generator maps samples drawn from $N (0, I)$ to meaningful images. Can GAN map images in one domain to ones in another domain, for example from summer pictures to winter pictures of the same place, from horse pictures to zebra pictures, and from photos to Van Gogh’s paintings?

It is not hard to design a GAN for that purpose. For example, suppose our goal is to add black/white stripes to a horse and make it look like a zebra, we could replace $z \sim N (0, I)$ with a bunch of horse pictures, and the real samples will be drawn from zebra pictures. Unfortunately, this architecture does not ensure that the generated zebra picture will be much like the input horse picture. CycleGAN introduced two innovations to address this problem:

Create two GANs. The ﬁrst GAN translates horse pictures to zebra pictures, and the second GAN translates zebra pictures to horse pictures.
Each generated zebra picture is then fed to the second GAN to translate to a horse picture. This generated horse picture should look similar to the original horse picture. A similar process happens with the reverse translation direction.

For simplicity, we’ll use the following notation.


Notation	Meaning


$x$	real samples from domain $X$

$y$	real samples from domain $Y$

$D_{x}$	descriminator for samples in $X$

$D_{y}$	descriminator for samples in $Y$

$G_{x y}$	generator that maps samples in $X$ to samples in $Y$

$G_{y x}$	generator that maps samples in $Y$ to samples in $X$

$L_{G A N} (X, Y, G_{x y}, D_{y})$	GAN’s loss function that involves $G_{x y}$ and $D_{y}$

$L_{G A N} (Y, X, G_{y x}, D_{x})$	GAN’s loss function that involves $G_{y x}$ and $D_{x}$

CycleGAN’s optimization objective is

L_{G A N} (X, Y, G_{x y}, D_{y}) + L_{G A N} (Y, X, G_{y x}, D_{x}) + λ E_{x} {∥ G_{y x} (G_{x y} (x)) - x ∥}_{1} + λ E_{y} {∥ G_{x y} (G_{y x} (y)) - y ∥}_{1}