Energy Based Models

Amaires@June 2024

1 Motivation

Essential to generative learning is the modeling of the probability density function (PDF) of given data. In theory, a deep neural network fπœƒ is capable of approxmating any function. fπœƒ in general, however, is not a valid PDF which has two fundamental requirements:

Non-negativity:
fπœƒ(x) β‰₯ 0
Normalization:
∫ fπœƒ(x) = 1

The non-negativity requirement is not hard to satisfy with simple transformations applied to fπœƒ. For example, exp (fπœƒ(x)) and fπœƒ2(x) are both non-negative.

The normalization requirement, however, is much harder to satisfy. There are a few approaches to this problem.

  1. Generative Adversarial Network (GAN) does not model the PDF or rely on the PDF for training. Instead, it only creates a model that can draw samples from.
  2. Autoregressive models break the PDF into the product of a series of conditional PDFs.
  3. Normalizing flow models use a sequence of bijective mappings to transform relatively simple distributions to the desired PDF.
  4. Variational AutoEncoders (VAE) optimizes the upper bound of likelihoods. Like GAN, it does not produce a true PDF at the end either.

Energy Based Models (EBM) take a different approach. EBM only models a non-normalized function Eπœƒ(x) with the expecation that the actual PDF will be

pπœƒ = exp (Eπœƒ(x)) Zπœƒ ,whereZπœƒ = ∫ exp (Eπœƒ(x))

Zπœƒ, the normalization numerator and a function of πœƒ but not x , is also called the partition function. EBM has some of its roots in statistical physics and hence the name Energy Based Models. Eπœƒ(x), or in literature βˆ’ Eπœƒ(x), is called the energy function. Without the normalization requirement, and unlike autoregressive models and normalizing flow models, EBM can give Eπœƒ more flexibity and potentially make it more powerful.

Since EBM only explicitly model Eπœƒ, but not Zπœƒ or pπœƒ, so any task that strictly requires pπœƒ is out of the question. Eπœƒ, however, is sufficient for comparing pπœƒ(x1) and pπœƒ(x2) since

pπœƒ(x1) > pπœƒ(x2)⇔exp (Eπœƒ(x1)) > exp (Eπœƒ(x2))⇔Eπœƒ(x1) > Eπœƒ(x2) (1)

This property is enough to enable a lot of practical deep learning applications such as object recognition, paining restoration and sequence labeling etc.

2 Sampling

Since EBMs do not explicitly model pπœƒ, how are samples drawn given Eπœƒ?

The Metopolis-Hastings Markov Chain Monte Carlo (M-H MCMC) method described in Algorithm 1 is a relatively simple solution. The * step ensures that sufficient space of x is sampled and that the algorithm does not get stuck with local maximum. The M-H MCMC method works in theory, but can take a very long time to converge.

x:=norm_random()

untilΒ convergence:

y :=x+πœ–β‹…norm_random()

ifΒ Eπœƒ(y) > Eπœƒ(x):

x := y

else:

withΒ probabilityΒ exp (Eπœƒ(y) βˆ’ Eπœƒ(x)):

x := yΒ Β Β Β Β Β Β [*]

returnΒ x

Algorithm 1:Metropolis-Hastings Markov Chain Monte Carlo method

One obvious way to speed up the M-H MCMC method is to take advantage of the gradient of pπœƒ with respect to x and use it to find x with higher probability. That gradient still depends on Zπœƒ however as

βˆ‡xpπœƒ(x) = 1 Zπœƒ exp (Eπœƒ(x))βˆ‡xEπœƒ(x).

Fortunately the gradient of log pπœƒ(x), also called the score function sπœƒ(x) only depends on Eπœƒ(x) because

sπœƒ(x) = βˆ‡x log pπœƒ(x) = βˆ‡xEπœƒ(x) βˆ’βˆ‡x log Zπœƒ = βˆ‡xEπœƒ(x).

The last step of the derivation works because Zπœƒ does not depend on x and hence βˆ‡x log Zπœƒ = 0. The Langevin MCMC method, described in Algorithm 2, works exactly by levaraging sπœƒ(x). Again, the randomization in the * step helps the algorithm get out of local maximum and sample more space of x.

x :=norm_random()

untilΒ convergence:

x := x + πœ– β‹… sπœƒ(x) + 2πœ–β‹…norm_random()Β Β [*]

returnΒ x

Algorithm 2:Langevin MCMC method

3 Training

There are multiple different ways of training EBMs. Some require sampling from the model being trained, and others do not.

3.1 Maximum Likelihood Method or Contrastive Divergence

Surprisingly, it is possible to conduct maximum likelihood optimization for EBM without modeling the PDF. Let’s start with a little math:

max Ex∼p(x) log pπœƒ(x) = max πœƒEx∼p(x)[Eπœƒ(x) βˆ’ log Zπœƒ] = max πœƒ[Ex∼p(x)Eπœƒ(x) βˆ’ log Zπœƒ]

The likelihood gradient for updating πœƒ is

βˆ‡πœƒ[Ex∼p(x)Eπœƒ(x) βˆ’ log Zπœƒ] = Ex∼p(x)βˆ‡πœƒEπœƒ(x) βˆ’βˆ‡πœƒ log Zπœƒ = Ex∼p(x)βˆ‡πœƒEπœƒ(x) βˆ’ 1 Zπœƒβˆ‡πœƒZπœƒ = Ex∼p(x)βˆ‡πœƒEπœƒ(x) βˆ’ 1 Zπœƒβˆ‡πœƒ[∫ exp (Eπœƒ(x))dx] = Ex∼p(x)βˆ‡πœƒEπœƒ(x) βˆ’ 1 Zπœƒ ∫ βˆ‡πœƒ exp (Eπœƒ(x))dx = Ex∼p(x)βˆ‡πœƒEπœƒ(x) βˆ’ 1 Zπœƒ ∫ exp (Eπœƒ(x))βˆ‡πœƒEπœƒ(x)dx = Ex∼p(x)βˆ‡πœƒEπœƒ(x) βˆ’ 1 Zπœƒ ∫ exp (Eπœƒ(x))βˆ‡πœƒEπœƒ(x)dx = Ex∼p(x)βˆ‡πœƒEπœƒ(x) βˆ’βˆ« 1 Zπœƒ exp (Eπœƒ(x))βˆ‡πœƒEπœƒ(x)dx = Ex∼p(x)βˆ‡πœƒEπœƒ(x) βˆ’βˆ« pπœƒ(x)βˆ‡πœƒEπœƒ(x)dx = Ex∼p(x)βˆ‡πœƒEπœƒ(x) βˆ’ Ex∼pπœƒ(x)βˆ‡πœƒEπœƒ(x)

Note that the first term involves p(x), the second term involves pπœƒ(x), and neither depends on Zπœƒ. Also note that here the likelihood gradient is with respect to πœƒ, the parameters, not x. Don’t confuse it with the score function, which is a gradient with respect to x. The likelihood gradient points in the direction where the energy function’s gradient differs the most between real samples and model samples. This is probably where the name Contrastive Divergence got its name.

The big picture here is that even though the partition function Zπœƒ is not modeled, it is still possible to estimate the likelihood function’s gradient and conduct maximum likelihood training. Each training step though requires drawing samples from the model being trained, which can be expensive, as described in Section 2.

3.2 Score Matching

If βˆ‡xp(x) and βˆ‡xpπœƒ(x) are equal everywhere, then p(x) = pπœƒ(x)+constant. Therefore, if βˆ‡x log p(x) and βˆ‡x log pπœƒ(x) are equal everywhere, then log p(x) = log pπœƒ(x)+constant. That constant difference can be removed since both p(x) and pπœƒ(x) have to integrate to 1.

That is the key idea behind score matching, a method that matches the scores or score functions of two distributions everywhere as an alternative to maximum likelihood based training. The objective of score matching is to minimize the Fisher Divergence between p(x) and pπœƒ(x):

min πœƒFD(p(x),pπœƒ(x)) = min πœƒ1 2Ex∼p(x) βˆ₯βˆ‡x log p(x) βˆ’βˆ‡x log pπœƒ(x)βˆ₯22 = min πœƒ1 2Ex∼p(x) βˆ₯βˆ‡x log p(x) βˆ’βˆ‡xEπœƒ(x)βˆ₯22

We’ll show how this objective can be manipulated to not dependent on the unknown p(x) in the univariate case.

min πœƒ1 2Ex∼p(x) βˆ₯βˆ‡x log p(x) βˆ’βˆ‡xEπœƒ(x)βˆ₯22 = min πœƒ1 2∫ p(x)[log β€²p(x) βˆ’ E πœƒβ€² (x)]2 = min πœƒ1 2∫ p(x)[(log β€²p(x))2 + (E πœƒβ€² (x))2 βˆ’ 2log β€²p(x)E πœƒβ€² (x)] = min πœƒ[1 2∫ p(x)(log β€²p(x))2 + 1 2∫ p(x)(Eπœƒβ€² (x))2 βˆ’βˆ« p(x)log β€²p(x)E πœƒβ€² (x)] (2)

The first term does not depend on πœƒ, and can therefore be left out. The third term still has log β€²p(x) in it. Recall the integration by parts formula states that

∫ abu(x)vβ€²(x)dx = u(x)v(x)| ab βˆ’βˆ« abuβ€²(x)v(x)dx

and it can be used the rewrite the third term:

∫ p(x)log β€²p(x)E πœƒβ€² (x) = ∫ p(x) 1 p(x)pβ€²(x)E πœƒβ€² (x) = ∫ pβ€²(x)E πœƒβ€² (x) = p(x)Eπœƒβ€² (x)|βˆ’βˆž+βˆžβˆ’βˆ« p(x)Eπœƒβ€³(x) = 0 βˆ’βˆ« p(x)Eπœƒβ€³(x) (3) = βˆ’Ex∼p(x)Eπœƒβ€³(x) (4)

The derivation of (3) makes the very reasonable assumption that lim +∞p(x) = lim βˆ’βˆžp(x) = 0.

Eliminating the first term in (2), rewriting the second term in the expectation form, and subtituting the third term with (4), we have

min πœƒ[1 2Ex∼p(x)(Eπœƒβ€² (x))2 + E x∼p(x)Eπœƒβ€³ (x)] = min πœƒEx∼p(x)[1 2(Eπœƒβ€² (x))2 + E πœƒβ€³ (x)].

The multivariate version of the objective can be shown to be

min πœƒEx∼p(x)[1 2 βˆ₯βˆ‡πœƒEπœƒ(x)βˆ₯22 + tr(βˆ‡ πœƒ2E πœƒ(x))]

where the second term is the trace of the Hessian matrix of Eπœƒ(x). Loosely speaking, the first term tries to find πœƒ such that the samples x are the local maximums or minimums (with gradients as close to 0 as possible), and the second term tries to make sure it is actually local maximums (with second order gradients as negative as possible).

The score matching training method avoids the very expensive procedure of drawing samples from the model being trained. Its main expensive operation is the computation of the trace of the Hessian matrix. There are more research in this space that will be explored in future tutorials.

3.3 Noise Contrastive Estimation

Noise Contrastive Estimation (NCE) is another training method for EBMs without requiring drawing samples from the models being trained. Recall that in Generative Adversarial Networks (GAN) (amaires.github.io/GAN), given a fixed Generator GΟ•, the optimal Discriminator Dπœƒβ€™s output is

Dπœƒβˆ—(x) = p(x) p(x) + pΟ•(x)

The result holds if GΟ• and pΟ• are replaced with any static known noise distribution pn(x). That is

Dπœƒβˆ—(x) = p(x) p(x) + pn(x)

Note here n is not a parameter; it just means noise.

If Dπœƒβ€™s neural network is explicitly constructed as

Dπœƒ(x) = Fπœƒ(x) Fπœƒ(x) + pn(x)

then

Dπœƒβˆ—(x) = p(x) p(x) + pn(x) ≃ Fπœƒβˆ—(x) Fπœƒβˆ—(x) + pn(x)

solving it basically shows that Fπœƒβˆ—(x) ≃ p(x) which also means Fπœƒ(x) is automatically normalized if all stars are aligned. Now if Fπœƒ(x) is replaced with an energy function based PDF function

exp (Eπœƒ(x)) Z

where Z is an additional parameter, which is not guaranteed to be equal to Eπœƒ(x)’s partition function Zπœƒ, we have

Dπœƒ,Z(x) = exp (Eπœƒ(x)) Z exp (Eπœƒ(x)) Z + pn(x) = exp (Eπœƒ(x)) exp (Eπœƒ(x)) + Zpn(x) (5)

and

exp (Eπœƒβˆ—(x)) Zβˆ— ≃ p(x)

where Eπœƒβˆ— would be our trained energy model.

With Dπœƒ constructed as in (5), Dπœƒβ€™s optimization objective becomes

max πœƒ,ZEx∼p(x) log Dπœƒ,Z(x) + Ex∼pn(x) log (1 βˆ’ Dπœƒ,Z(x)) = max πœƒ,ZEx∼p(x)[Eπœƒ(x) βˆ’ log (exp (Eπœƒ(x)) + Zpn(x)] + Ex∼pn(x)[log (Zpn(x)) βˆ’ log (exp (Eπœƒ(x)) + Zpn(x)]

3.4 Flow Contrastive Estimation

In theory, there are no requirements on the static noise distribution pn(x) for NCE. In practice, the closer pn(x) is to p (but not identical), the more effective NCE is. Flow Contrastive Estimation parameterizes pn(x) as pΟ•(x) with a normalizing flow model because normalizing flow models are easy to sample and give tractable PDF. The discriminator is now modeled as

Dπœƒ,Z,Ο•(x) = exp (Eπœƒ(x)) exp (Eπœƒ(x)) + ZpΟ•(x)

and the objective function is

max πœƒ,Z min Ο•Ex∼p(x) log (Dπœƒ,Z,Ο•(x)) + Ex∼pΟ•(x) log (1 βˆ’ Dπœƒ,Z,Ο•(x))