Generative Adversarial Networks

Amaires@June 2024

Generative Adversarial Networks (GAN) are an approach to generative artificial intelligence. It is the first known model to produce new photorealistic images automatically.

1 Original GAN

With data samples that follow a certain unknown distribution p(x), the idea is to have a neural network generator G𝜃, parameterized by 𝜃, which transforms samples z N(0,I) to x~ that follows p(x). Figure 1 depicts the architecture.

PIC

Figure 1:Generator G𝜃

In order to train G𝜃, some optimization objective is needed to give G𝜃 feedback about whether x~ really looks like it is drawn from p(x). A binary descriminator/classifier Dϕ that returns the probability that x~ follows p(x) would serve the purpose. Figure 2 shows the architecture we have so far.

PIC

Figure 2:Generator G𝜃 and Discriminator Dϕ

The introduction of Dϕ merely defers the responsibility of guiding G𝜃. How do we train Dϕ? We could feed it both real samples (x,y) = (,1) and artificial samples (x,y) = (x~,0) created by G𝜃. Figure 3 shows the complete GAN architecture.

PIC

Figure 3:The Complete GAN Architecture

The loss function of Dϕ is the common binary cross entropy function, shown below (see amaires.github.io/OptimizationObjective/ for a refresher):

= min 𝜃 Ex,yp(x,y)(ylog Dϕ(x) + (1 y)log (1 Dϕ(x)) (1)

Remove the negation in (1) and rewrite it in conditional expectation form:

= max ϕEyp(y)Exp(x|y)(ylog Dϕ(x) + (1 y)log (1 Dϕ(x)) = max ϕ[Pr(y = 1)Exp(x|y=1) log Dϕ(x) + Pr(y = 0)Exp(x|y=0) log (1 Dϕ(x)]

If the same number of real samples and artificial samples x~ are fed to Dϕ in each batch/mini-batch, Pr(y = 1) = Pr(y = 0) = 1 2. Also note that x p(x|y = 1) is the same as p() and x p(x|y = 0) is the same as x~ p𝜃(x~|z), can be further written:

= max ϕ[1 2Ep()) log Dϕ() + 1 2Ex~p𝜃(x~) log (1 Dϕ(x~)] L = max ϕ[1 2Ep()) log Dϕ() + 1 2EzN(0,I)Ex~p𝜃(x~|z) log (1 Dϕ(x~))] After removing the constant 1 2, if for every sample z, only one sample is drawn for x~, which is what G𝜃does, becomes
= max ϕ[Ep()) log Dϕ() + EzN(0,I) log (1 Dϕ(G𝜃(z))]

Note that the objective is parameterized by both 𝜃 and ϕ, but it is maximized only in terms of ϕ, the paramters for the discriminator. It is also worth noting that Dϕ gives the probability if a sample follows p(x), but not p(x) itself. Also, G𝜃 generates a sample x~, not p𝜃(x~).

The process so far optimizes ϕ to make Dϕ better at telling real samples apart from samples created by a fixed G𝜃. For each batc/mini-batch of samples, ϕ takes a gradient ascent step. This process does not seem to improve G𝜃 whatsoever. If 𝜃 also takes a gradient ascent step using 𝜃 , it also makes Dϕ better. In this case, it is probably equivalent to making samples created by G𝜃 more obviously fake, the opposite of what we want. The solution to this last piece of the GAN puzzle is to update 𝜃 with a gradient descent step. More formally, our objective function is

= min 𝜃 max ϕ[Ep()) log Dϕ() + EzN(0,I) log (1 Dϕ(G𝜃(z))]

Intuitively, D𝜃 tries to tell real samples apart from artificial samples, and G𝜃 tries to create artificial samples that are hard to distinguish from real samples, hence the name generative adversarial network.

In GAN, p(x) and p𝜃 are never explicitly modeled. G𝜃 can produce good samples following p𝜃(x), but it does not know the form of p𝜃(x). In other words, G𝜃 defines a sampling process without knowing its distribution.

1.1 Jenson-Shannon Divergence

For a given G𝜃, Dϕ’s objective is to maximize

L𝜃,ϕ = Exp(x) log Dϕ(x) + Exp𝜃(x)(1 log Dϕ(x)) =[p(x)log Dϕ(x) + p𝜃(x)log (1 Dϕ(x))]dx (2)

Let l𝜃,ϕ(x) be the function under the integral in (2). Assuming Dϕ is flexible and powerful enough, when L𝜃,ϕ is maximized with parameter ϕ, l𝜃,ϕ(x) is also maximized everywhere. Take l𝜃,ϕ(x)’s derivative against Dϕ(x), we have

dl𝜃,ϕ(x) dDϕ(x) = p(x) Dϕ(x) p𝜃(x) 1 Dϕ(x)

Set the expression to 0, you derive Dϕ(x):

Dϕ(x) = p(x) p(x) + p𝜃(x) (3)

Substitute Dϕ(x) with (3) in (2), we have:

L𝜃 = L𝜃,ϕ =[p(x)log Dϕ(x) + p𝜃(x)log (1 Dϕ(x))]dx =[p(x)log p(x) p(x) + p𝜃(x) + p𝜃(x)log p𝜃(x) p(x) + p𝜃(x)]dx =[p(x)log 1 2p(x) 1 2(p(x) + p𝜃(x)) + p𝜃(x)log 1 2p𝜃(x) 1 2(p(x) + p𝜃(x))]dx =p(x)log p(x) 1 2(p(x) + p𝜃(x))dx +p𝜃(x)log p𝜃(x) 1 2(p(x) + p𝜃(x))dx log 2p(x)dx log 2p𝜃(x)dx = DL[p(x), p(x) + p𝜃(x) 2 ] + DL[p𝜃(x), p(x) + p𝜃(x) 2 ] log 4

The first two terms is actually twice the Jenson-Shannon Divergence (JSD) defined as:

JS(p1,p2) = 1 2[DL(p1, p1 + p2 2 ) + DL(p2, p1 + p2 2 )]

Using JSD, L𝜃 is further simplified as:

L𝜃 = 2JS(p(x),p𝜃(x)) log 4

Now it becomes clear that the generator G𝜃 is really optimizing the J-S divergence between p𝜃(x) and p(x) given the optimal Dϕ.

The Jenson-Shannon Divergence has a few properties as well.

1.2 Contrast Maximization

When Dϕ is trained reasonably well, which means Dϕ(x) is close to 1 for real samples, and is close to 0 for artifical samples, L𝜃,ϕcan be shown to maximize the contrast of Dϕ(x) between real and artifical samples, as shown below.

L𝜃,ϕ = Exp(x) log Dϕ(x) + Exp𝜃(x)(1 log Dϕ(x)) Exp(x)(Dϕ(x) 1) + Exp𝜃(x) Dϕ(x) (4) = Exp(x)Dϕ(x) Exp𝜃(x)Dϕ(x) 1

(4) uses the first derivative of log () at log (1) for approximation. Of course, when D𝜃 is not a very good discriminator yet, the above does not hold.

2 f-GAN

too much math and too little practical impact to write about... will pick up later

3 WGAN

3.1 Problems with GAN

GAN is known to generate very impressive photorealistic images, but it also has a few well documented drawbacks.

The first is a problem known as mode collapse. The discriminator Dϕ only cares about distinguishing real samples from artificial samples. It does not care about whether those artificial samples have broad coverage or not. For example, suppose the real samples include images of different animals such as cats, dogs, and horses. G𝜃 is happy to generate pictures of only dogs as long as these pictures become more real each time 𝜃 is updated. Dϕ is also perfectly happy with G𝜃’s behavior as long as each time ϕ is updated, Dϕ can tell real animal pictures from these artifical dog picutres a little better.

The second notorious problem with GAN is its difficulty to train. Unlike other deep learning problems which see their loss function decreases gradually until converging to 0 in training, GAN’s minimax objective does not offer any such guarantee. In practice, GAN’s objective keeps oscillating during training. Deciding when to stop training is often a manual process. Another reason that prevents GAN’s effective training has to do with Dϕ’s final activation function, typically sigmoid. When real samples and artificial samples are far apart, it is easy for Dϕ to distinguish them. In this case, sigmoid’s outputs are very close to 1 for real samples, and 0 for artificial samples. Its derivative is very close to 0, a problem known as vanishing gradient. These 0 gradients cannot provide effective back-propagation for G𝜃 to improve its sample generating process.

3.2 Intuition of WGAN

The inspiration of Wasserstein GAN (WGAN) comes from contrast maximization described in Section 1.2 and sigmoid’s vanishing gradient problem described in Section 3.1. WGAN’s objective is

min 𝜃 max ϕL𝜃,ϕ = min 𝜃 max ϕ[Exp(x)Dϕ(x) Exp𝜃(x)Dϕ(x)]

Basically, Dϕ tries to maximize its output for real samples and minimize its output for artificial samples. G𝜃 on the other hand tries to generate artifical samples that also get large outputs.

This objective looks just like the contrast maximization loss in Section (1.2), but there are two main differences: it is no longer an approximation in a narrow range and sigmoid is not used to compress Dϕ’s output to between 0 and 1. Completely removing Dϕ’s output value constraint may, however, pose another problem; L𝜃,ϕ may grow very rapidly out of bound and may still take the form of a sigmoid function and stifle back-propagation. Ideally, we’d like Dϕ to behave roughly like a linear function of x. Given deep neural networks are differentiable for almost all input, we could force the norm of Dϕ’s gradient to be close to 1 everywhere:

xDϕ(x) 1

This constraint can be added to L𝜃,ϕ as a penalty term:

(xDϕ(x) 1)2

This penalty term needs to be numerically computable. Averaging over all possible value of x is out of the question, but one possbility is to average it over both the real samples and the artificial samples, as shown below:

L𝜃,ϕ = Exp(x)Dϕ(x) Exp𝜃(x)Dϕ(x) λExp(x)(xDϕ(x) 1)2 λE xp𝜃(x)(xDϕ(x) 1)2

where λ is the knob adjusting the relative importance between maximizing contrast and making Dϕ roughly a linear function.

A variant of the above formulation creates samples by randomly and linearly interpolating between real and artificial samples, and averages the penalty term over these samples instead. The final objective function, expressed in numerical computation form, is the following:

L𝜃,ϕ = 1 N iDϕ(i) 1 N iDϕ(x~i) λ N i(xDϕ[𝜀ii + (1 𝜀ix~i)] 1)2

where 𝜀i U(0,1). GAN with this gradient norm penalty is called Wasserstein Generative Adversarial Network - Gradient Penalty (WGAN-GP).

Both Dϕ and G𝜃 affects the penalty term. Though it is not mentioned in the original WGAN-GP paper, I don’t think it is desirable to add the penalty term to G𝜃’s loss function. WGAN-GP’s objective should instead be

max ϕ[ 1 N iDϕ(i) 1 N iDϕ(x~i) λ N i(xDϕ[𝜀ii + (1 𝜀i)x~i] 1)2] min 𝜃[ 1 N iDϕ(i) 1 N iDϕ(x~i)]

There are a couple more things worth noting.

3.3 Math of WGAN

3.3.1 Infimum and supremum

Most people are familiar with the concept of maximum and minimum. Explicitly, if X is a (partially) ordered set and S a subset, then s¯ is the maximum of S iff s¯ S and s s¯ for all s S. Similarly, s̲ is the minimum of S iff s̲ S and s s̲ for all s S.

The supremum (sup) of S can be defined like this. Let T = {t X|s ts S}, which defines the set of elements greater than all members of S. If T is empty, S’s supremum does not exist, otherwise it is the mininum of T. If S has a maximum, it must be the same as S’s supremum. Even if S does not have a maximum, it may still have a supremum. Below are three examples comparing maximum and supremum.

  1. S = {x|x 2}: S’s maximum is 2, and its supremum is 2 as well.
  2. S = {x|x < 2}: S does not have a maximum, but its supremum is 2.
  3. S = {x|x > 2}: S has neither a maximum nor a supremum.

Similar comparisons can be made between minimum and infimum. Informally, if one uses supremum and maximum interexchangeably, little is lost. The same goes for infimum and minimum.

3.3.2 Wasserstein Distance

K-L divergence and J-S divergence are often used to measure the closeness between two distributions. In fact, VAE (amaires.github.io/VAE) uses K-L divergence for optimization and GAN’s G𝜃 minimizes the J-S divergence given an optimal Dϕ. Unfortunately, these two measures have discontinuity when two distributions have disjoint supports (the support of a function is the subset of the function domain not mapped to 0). For example, given two distributions defined below:

p(x) = { 1x = 0 0 x 0 andp𝜃(x) = { 1x = 𝜃 0 x 𝜃

It is not hard to figure out their K-L and J-S divergence:

KL(p,p𝜃) = { 0 𝜃 = 0 𝜃 0 andJS(p,p𝜃) = { 0 𝜃 = 0 log 2 𝜃 0

Ideally, we’d like a measure that is smoother. Wasserstein distance is exactly such a function defined as:

WS(p1,p2) = inf γ(p1,p2)E(x,y)γ x y1

where (p1,p2) contains all joint distribution of (x,y) such that p1(x) =γ(x,y)dy and p2(y) =γ(x,y)dx. Wasserstein distance is also called earth-mover distance. It informally captures the minimal amount of mass/dirt needs to be moved to turn the shape of p1 into that of p2. Using this definition, the Wasserstain distance between p(x) and p𝜃(x) above can be calculated to be |𝜃|, which is a continuous function of 𝜃.

In general, however, Wasserstein distance is intractable to calculate. Fortunately, Wasserstein distance has another definition, based on the Kantorovich-Rubinstein duality, which is easier to handle:

WS(p1,p2) = sup f L1Exp1f(x) Exp2f(x) (5)

where f() is any real valued function. f L 1 means f’s Lipschitz constant is 1. Technically, it means

|f(x) f(y)| x y1x,y

Intuitively, it is equivalent to saying f(x)’s value should not change too much as x changes.

Wasserstein distance (5) really captures the contrast maximization idea in Section 1.2 well. The Lipschitz continuity constraint can be approximated by the gradient penalty term introduced in 3.2.

3.3.3 GAN and WGAN

GAN’s Dϕ optimizes maximum likelihood of observing samples and x~. Given an optional Dϕ, G𝜃 minimizes the J-S divergence between p𝜃(x) and p(x).

In WGAN, Dϕ maximizes the Wasserstein distance between and x~ while G𝜃 tries to reduce it. Wasserstein distance is a smoother and more effective measure of closeness between two distributions, resulting in more stable training and less mode collapse in WGAN than GAN.

4 Latent Representation and BiGAN

The training of Variational Autoencoders produces both a generator and a encoder. The latter is capable of extracting features or latent representations of data. GAN only has a generator G𝜃. It is conceivable that the pre-final layers of the discriminator Dϕ may be used for feature representations. The intuition is that Dϕ would have learned useful high-level representations of both real and artificial samples.

Bidirectional Generative Adversarial Networks (BiGAN), depicted in Figure 4, takes a much more direct approach to latent representation. It introduces an encoder Eγ that maps real samples to z~ in the latent space. Dϕ works on the joint distribution of (x,z) and tries to maximize the contrast between real samples (,z~) and artificial samples (x~,z). The loss function introduced in WGAN-GP can be used here. After training is done, new samples can be generated by G𝜃, and latent representations can be inferred via Eγ.

PIC

Figure 4:BiGAN architecture

5 Image-to-image translation and CycleGAN

A GAN’s generator maps samples drawn from N(0,I) to meaningful images. Can GAN map images in one domain to ones in another domain, for example from summer pictures to winter pictures of the same place, from horse pictures to zebra pictures, and from photos to Van Gogh’s paintings?

It is not hard to design a GAN for that purpose. For example, suppose our goal is to add black/white stripes to a horse and make it look like a zebra, we could replace z N(0,I) with a bunch of horse pictures, and the real samples will be drawn from zebra pictures. Unfortunately, this architecture does not ensure that the generated zebra picture will be much like the input horse picture. CycleGAN introduced two innovations to address this problem:

  1. Create two GANs. The first GAN translates horse pictures to zebra pictures, and the second GAN translates zebra pictures to horse pictures.
  2. Each generated zebra picture is then fed to the second GAN to translate to a horse picture. This generated horse picture should look similar to the original horse picture. A similar process happens with the reverse translation direction.

For simplicity, we’ll use the following notation.



Notation Meaning




x real samples from domain X


y real samples from domain Y


Dx descriminator for samples in X


Dy descriminator for samples in Y


Gxygenerator that maps samples in X to samples in Y


Gyxgenerator that maps samples in Y to samples in X


LGAN(X,Y,Gxy,Dy) GAN’s loss function that involves Gxy and Dy


LGAN(Y,X,Gyx,Dx) GAN’s loss function that involves Gyx and Dx



CycleGAN’s optimization objective is

LGAN(X,Y,Gxy,Dy) + LGAN(Y,X,Gyx,Dx) + λEx Gyx(Gxy(x)) x1 + λEy Gxy(Gyx(y)) y1