With data samples that follow a certain unknown distribution , the idea is to have a neural network generator , parameterized by , which transforms samples to that follows . Figure 1 depicts the architecture.
In order to train , some optimization objective is needed to give feedback about whether really looks like it is drawn from . A binary descriminator/classifier that returns the probability that follows would serve the purpose. Figure 2 shows the architecture we have so far.
The introduction of merely defers the responsibility of guiding . How do we train ? We could feed it both real samples and artificial samples created by . Figure 3 shows the complete GAN architecture.
The loss function of is the common binary cross entropy function, shown below (see amaires.github.io/OptimizationObjective/ for a refresher):
| (1) |
Remove the negation in (1) and rewrite it in conditional expectation form:
If the same number of real samples and artificial samples are fed to in each batch/mini-batch, . Also note that is the same as and is the same as , can be further written:
After removing the constant , if for every sample , only one sample is drawn for , which is what does, becomes
Note that the objective is parameterized by both and , but it is maximized only in terms of , the paramters for the discriminator. It is also worth noting that gives the probability if a sample follows , but not itself. Also, generates a sample , not .
The process so far optimizes to make better at telling real samples apart from samples created by a fixed . For each batc/mini-batch of samples, takes a gradient ascent step. This process does not seem to improve whatsoever. If also takes a gradient ascent step using , it also makes better. In this case, it is probably equivalent to making samples created by more obviously fake, the opposite of what we want. The solution to this last piece of the GAN puzzle is to update with a gradient descent step. More formally, our objective function is
Intuitively, tries to tell real samples apart from artificial samples, and tries to create artificial samples that are hard to distinguish from real samples, hence the name generative adversarial network.
In GAN, and are never explicitly modeled. can produce good samples following , but it does not know the form of . In other words, defines a sampling process without knowing its distribution.
For a given , ’s objective is to maximize
Let be the function under the integral in (2). Assuming is flexible and powerful enough, when is maximized with parameter , is also maximized everywhere. Take ’s derivative against , we have
Set the expression to 0, you derive :
| (3) |
Substitute with (3) in (2), we have:
The first two terms is actually twice the Jenson-Shannon Divergence (JSD) defined as:
Using JSD, is further simplified as:
Now it becomes clear that the generator is really optimizing the J-S divergence between and given the optimal .
The Jenson-Shannon Divergence has a few properties as well.
When is trained reasonably well, which means is close to 1 for real samples, and is close to 0 for artifical samples, can be shown to maximize the contrast of between real and artifical samples, as shown below.
(4) uses the first derivative of at for approximation. Of course, when is not a very good discriminator yet, the above does not hold.
too much math and too little practical impact to write about... will pick up later
GAN is known to generate very impressive photorealistic images, but it also has a few well documented drawbacks.
The first is a problem known as mode collapse. The discriminator only cares about distinguishing real samples from artificial samples. It does not care about whether those artificial samples have broad coverage or not. For example, suppose the real samples include images of different animals such as cats, dogs, and horses. is happy to generate pictures of only dogs as long as these pictures become more real each time is updated. is also perfectly happy with ’s behavior as long as each time is updated, can tell real animal pictures from these artifical dog picutres a little better.
The second notorious problem with GAN is its difficulty to train. Unlike other deep learning problems which see their loss function decreases gradually until converging to 0 in training, GAN’s minimax objective does not offer any such guarantee. In practice, GAN’s objective keeps oscillating during training. Deciding when to stop training is often a manual process. Another reason that prevents GAN’s effective training has to do with ’s final activation function, typically sigmoid. When real samples and artificial samples are far apart, it is easy for to distinguish them. In this case, sigmoid’s outputs are very close to 1 for real samples, and 0 for artificial samples. Its derivative is very close to 0, a problem known as vanishing gradient. These 0 gradients cannot provide effective back-propagation for to improve its sample generating process.
The inspiration of Wasserstein GAN (WGAN) comes from contrast maximization described in Section 1.2 and sigmoid’s vanishing gradient problem described in Section 3.1. WGAN’s objective is
Basically, tries to maximize its output for real samples and minimize its output for artificial samples. on the other hand tries to generate artifical samples that also get large outputs.
This objective looks just like the contrast maximization loss in Section (1.2), but there are two main differences: it is no longer an approximation in a narrow range and sigmoid is not used to compress ’s output to between 0 and 1. Completely removing ’s output value constraint may, however, pose another problem; may grow very rapidly out of bound and may still take the form of a sigmoid function and stifle back-propagation. Ideally, we’d like to behave roughly like a linear function of . Given deep neural networks are differentiable for almost all input, we could force the norm of ’s gradient to be close to 1 everywhere:
This constraint can be added to as a penalty term:
This penalty term needs to be numerically computable. Averaging over all possible value of is out of the question, but one possbility is to average it over both the real samples and the artificial samples, as shown below:
where is the knob adjusting the relative importance between maximizing contrast and making roughly a linear function.
A variant of the above formulation creates samples by randomly and linearly interpolating between real and artificial samples, and averages the penalty term over these samples instead. The final objective function, expressed in numerical computation form, is the following:
where . GAN with this gradient norm penalty is called Wasserstein Generative Adversarial Network - Gradient Penalty (WGAN-GP).
Both and affects the penalty term. Though it is not mentioned in the original WGAN-GP paper, I don’t think it is desirable to add the penalty term to ’s loss function. WGAN-GP’s objective should instead be
There are a couple more things worth noting.
Most people are familiar with the concept of maximum and minimum. Explicitly, if is a (partially) ordered set and a subset, then is the maximum of iff and for all . Similarly, is the minimum of iff and for all .
The supremum (sup) of can be defined like this. Let , which defines the set of elements greater than all members of . If is empty, ’s supremum does not exist, otherwise it is the mininum of . If S has a maximum, it must be the same as ’s supremum. Even if does not have a maximum, it may still have a supremum. Below are three examples comparing maximum and supremum.
Similar comparisons can be made between minimum and infimum. Informally, if one uses supremum and maximum interexchangeably, little is lost. The same goes for infimum and minimum.
K-L divergence and J-S divergence are often used to measure the closeness between two distributions. In fact, VAE (amaires.github.io/VAE) uses K-L divergence for optimization and GAN’s minimizes the J-S divergence given an optimal . Unfortunately, these two measures have discontinuity when two distributions have disjoint supports (the support of a function is the subset of the function domain not mapped to 0). For example, given two distributions defined below:
It is not hard to figure out their K-L and J-S divergence:
Ideally, we’d like a measure that is smoother. Wasserstein distance is exactly such a function defined as:
where contains all joint distribution of such that and . Wasserstein distance is also called earth-mover distance. It informally captures the minimal amount of mass/dirt needs to be moved to turn the shape of into that of . Using this definition, the Wasserstain distance between and above can be calculated to be , which is a continuous function of .
In general, however, Wasserstein distance is intractable to calculate. Fortunately, Wasserstein distance has another definition, based on the Kantorovich-Rubinstein duality, which is easier to handle:
| (5) |
where is any real valued function. means ’s Lipschitz constant is 1. Technically, it means
Intuitively, it is equivalent to saying ’s value should not change too much as changes.
Wasserstein distance (5) really captures the contrast maximization idea in Section 1.2 well. The Lipschitz continuity constraint can be approximated by the gradient penalty term introduced in 3.2.
GAN’s optimizes maximum likelihood of observing samples and . Given an optional , minimizes the J-S divergence between and .
In WGAN, maximizes the Wasserstein distance between and while tries to reduce it. Wasserstein distance is a smoother and more effective measure of closeness between two distributions, resulting in more stable training and less mode collapse in WGAN than GAN.
The training of Variational Autoencoders produces both a generator and a encoder. The latter is capable of extracting features or latent representations of data. GAN only has a generator . It is conceivable that the pre-final layers of the discriminator may be used for feature representations. The intuition is that would have learned useful high-level representations of both real and artificial samples.
Bidirectional Generative Adversarial Networks (BiGAN), depicted in Figure 4, takes a much more direct approach to latent representation. It introduces an encoder that maps real samples to in the latent space. works on the joint distribution of and tries to maximize the contrast between real samples and artificial samples . The loss function introduced in WGAN-GP can be used here. After training is done, new samples can be generated by , and latent representations can be inferred via .
A GAN’s generator maps samples drawn from to meaningful images. Can GAN map images in one domain to ones in another domain, for example from summer pictures to winter pictures of the same place, from horse pictures to zebra pictures, and from photos to Van Gogh’s paintings?
It is not hard to design a GAN for that purpose. For example, suppose our goal is to add black/white stripes to a horse and make it look like a zebra, we could replace with a bunch of horse pictures, and the real samples will be drawn from zebra pictures. Unfortunately, this architecture does not ensure that the generated zebra picture will be much like the input horse picture. CycleGAN introduced two innovations to address this problem:
For simplicity, we’ll use the following notation.
| Notation | Meaning |
| real samples from domain | |
| real samples from domain | |
| descriminator for samples in | |
| descriminator for samples in | |
| generator that maps samples in to samples in | |
| generator that maps samples in to samples in | |
| GAN’s loss function that involves and | |
| GAN’s loss function that involves and | |
CycleGAN’s optimization objective is