Essential to generative learning is the modeling of the probability density function (PDF) of given data. In theory, a deep neural network is capable of approxmating any function. in general, however, is not a valid PDF which has two fundamental requirements:
The non-negativity requirement is not hard to satisfy with simple transformations applied to . For example, and are both non-negative.
The normalization requirement, however, is much harder to satisfy. There are a few approaches to this problem.
Energy Based Models (EBM) take a diο¬erent approach. EBM only models a non-normalized function with the expecation that the actual PDF will be
, the normalization numerator and a function of but not , is also called the partition function. EBM has some of its roots in statistical physics and hence the name Energy Based Models. , or in literature , is called the energy function. Without the normalization requirement, and unlike autoregressive models and normalizing ο¬ow models, EBM can give more ο¬exibity and potentially make it more powerful.
Since EBM only explicitly model , but not or , so any task that strictly requires is out of the question. , however, is suο¬cient for comparing and since
| (1) |
This property is enough to enable a lot of practical deep learning applications such as object recognition, paining restoration and sequence labeling etc.
Since EBMs do not explicitly model , how are samples drawn given ?
The Metopolis-Hastings Markov Chain Monte Carlo (M-H MCMC) method described in Algorithm 1 is a relatively simple solution. The * step ensures that suο¬cient space of is sampled and that the algorithm does not get stuck with local maximum. The M-H MCMC method works in theory, but can take a very long time to converge.
:=norm_random()
untilΒ convergence:
=+norm_random()
ifΒ :
else:
withΒ probabilityΒ :
Β Β Β Β Β Β Β [*]
returnΒ
One obvious way to speed up the M-H MCMC method is to take advantage of the gradient of with respect to and use it to ο¬nd with higher probability. That gradient still depends on however as
Fortunately the gradient of , also called the score function only depends on because
The last step of the derivation works because does not depend on and hence . The Langevin MCMC method, described in Algorithm 2, works exactly by levaraging . Again, the randomization in the * step helps the algorithm get out of local maximum and sample more space of .
norm_random()
untilΒ convergence:
norm_random()Β Β [*]
returnΒ
There are multiple diο¬erent ways of training EBMs. Some require sampling from the model being trained, and others do not.
Surprisingly, it is possible to conduct maximum likelihood optimization for EBM without modeling the PDF. Letβs start with a little math:
The likelihood gradient for updating is
Note that the ο¬rst term involves , the second term involves , and neither depends on . Also note that here the likelihood gradient is with respect to , the parameters, not . Donβt confuse it with the score function, which is a gradient with respect to . The likelihood gradient points in the direction where the energy functionβs gradient diο¬ers the most between real samples and model samples. This is probably where the name Contrastive Divergence got its name.
The big picture here is that even though the partition function is not modeled, it is still possible to estimate the likelihood functionβs gradient and conduct maximum likelihood training. Each training step though requires drawing samples from the model being trained, which can be expensive, as described in Section 2.
If and are equal everywhere, then constant. Therefore, if and are equal everywhere, then constant. That constant diο¬erence can be removed since both and have to integrate to 1.
That is the key idea behind score matching, a method that matches the scores or score functions of two distributions everywhere as an alternative to maximum likelihood based training. The objective of score matching is to minimize the Fisher Divergence between and :
Weβll show how this objective can be manipulated to not dependent on the unknown in the univariate case.
The ο¬rst term does not depend on , and can therefore be left out. The third term still has in it. Recall the integration by parts formula states that
and it can be used the rewrite the third term:
The derivation of (3) makes the very reasonable assumption that .
Eliminating the ο¬rst term in (2), rewriting the second term in the expectation form, and subtituting the third term with (4), we have
The multivariate version of the objective can be shown to be
where the second term is the trace of the Hessian matrix of . Loosely speaking, the ο¬rst term tries to ο¬nd such that the samples are the local maximums or minimums (with gradients as close to 0 as possible), and the second term tries to make sure it is actually local maximums (with second order gradients as negative as possible).
The score matching training method avoids the very expensive procedure of drawing samples from the model being trained. Its main expensive operation is the computation of the trace of the Hessian matrix. There are more research in this space that will be explored in future tutorials.
Noise Contrastive Estimation (NCE) is another training method for EBMs without requiring drawing samples from the models being trained. Recall that in Generative Adversarial Networks (GAN) (amaires.github.io/GAN), given a ο¬xed Generator , the optimal Discriminator βs output is
The result holds if and are replaced with any static known noise distribution . That is
Note here is not a parameter; it just means noise.
If βs neural network is explicitly constructed as
then
solving it basically shows that which also means is automatically normalized if all stars are aligned. Now if is replaced with an energy function based PDF function
where is an additional parameter, which is not guaranteed to be equal to βs partition function , we have
| (5) |
and
where would be our trained energy model.
With constructed as in (5), βs optimization objective becomes
In theory, there are no requirements on the static noise distribution for NCE. In practice, the closer is to (but not identical), the more eο¬ective NCE is. Flow Contrastive Estimation parameterizes as with a normalizing ο¬ow model because normalizing ο¬ow models are easy to sample and give tractable PDF. The discriminator is now modeled as
and the objective function is