Energy Based Models

Amaires@June 2024

1 Motivation

Essential to generative learning is the modeling of the probability density function (PDF) of given data. In theory, a deep neural network $f_{𝜃}$ is capable of approxmating any function. $f_{𝜃}$ in general, however, is not a valid PDF which has two fundamental requirements:

Non-negativity:: $f_{𝜃} (x) \geq 0$
Normalization:: $\int f_{𝜃} (x) = 1$

The non-negativity requirement is not hard to satisfy with simple transformations applied to $f_{𝜃}$ . For example, $\exp (f_{𝜃} (x))$ and $f_{𝜃}^{2} (x)$ are both non-negative.

The normalization requirement, however, is much harder to satisfy. There are a few approaches to this problem.

Generative Adversarial Network (GAN) does not model the PDF or rely on the PDF for training. Instead, it only creates a model that can draw samples from.
Autoregressive models break the PDF into the product of a series of conditional PDFs.
Normalizing ﬂow models use a sequence of bijective mappings to transform relatively simple distributions to the desired PDF.
Variational AutoEncoders (VAE) optimizes the upper bound of likelihoods. Like GAN, it does not produce a true PDF at the end either.

Energy Based Models (EBM) take a diﬀerent approach. EBM only models a non-normalized function $E_{𝜃} (x)$ with the expecation that the actual PDF will be

p_{𝜃} = \frac{\exp (E_{𝜃} (x))}{Z_{𝜃}}, w h e r e Z_{𝜃} = \int \exp (E_{𝜃} (x))

$Z_{𝜃}$ , the normalization numerator and a function of $𝜃$ but not $x$ , is also called the partition function. EBM has some of its roots in statistical physics and hence the name Energy Based Models. $E_{𝜃} (x)$ , or in literature $- E_{𝜃} (x)$ , is called the energy function. Without the normalization requirement, and unlike autoregressive models and normalizing ﬂow models, EBM can give $E_{𝜃}$ more ﬂexibity and potentially make it more powerful.

Since EBM only explicitly model $E_{𝜃}$ , but not $Z_{𝜃}$ or $p_{𝜃}$ , so any task that strictly requires $p_{𝜃}$ is out of the question. $E_{𝜃}$ , however, is suﬃcient for comparing $p_{𝜃} (x_{1})$ and $p_{𝜃} (x_{2})$ since

p_{𝜃} (x_{1}) > p_{𝜃} (x_{2}) \Leftrightarrow \exp (E_{𝜃} (x_{1})) > \exp (E_{𝜃} (x_{2})) \Leftrightarrow E_{𝜃} (x_{1}) > E_{𝜃} (x_{2})

(1)

This property is enough to enable a lot of practical deep learning applications such as object recognition, paining restoration and sequence labeling etc.

2 Sampling

Since EBMs do not explicitly model $p_{𝜃}$ , how are samples drawn given $E_{𝜃}$ ?

The Metopolis-Hastings Markov Chain Monte Carlo (M-H MCMC) method described in Algorithm 1 is a relatively simple solution. The * step ensures that suﬃcient space of $x$ is sampled and that the algorithm does not get stuck with local maximum. The M-H MCMC method works in theory, but can take a very long time to converge.

$x$ :=norm_random()

until convergence:

$y :$ = $x$ + $𝜖 \cdot$ norm_random()

if $E_{𝜃} (y) > E_{𝜃} (x)$ :

: $x : = y$

else:

with probability $\exp (E_{𝜃} (y) - E_{𝜃} (x))$ :

: $x : = y$ [*]

return $x$

Algorithm 1:Metropolis-Hastings Markov Chain Monte Carlo method

One obvious way to speed up the M-H MCMC method is to take advantage of the gradient of $p_{𝜃}$ with respect to $x$ and use it to ﬁnd $x$ with higher probability. That gradient still depends on $Z_{𝜃}$ however as

\nabla_{x} p_{𝜃} (x) = \frac{1}{Z_{𝜃}} \exp (E_{𝜃} (x)) \nabla_{x} E_{𝜃} (x) .

Fortunately the gradient of $\log p_{𝜃} (x)$ , also called the score function $s_{𝜃} (x)$ only depends on $E_{𝜃} (x)$ because

s_{𝜃} (x) = \nabla_{x} \log p_{𝜃} (x) = \nabla_{x} E_{𝜃} (x) - \nabla_{x} \log Z_{𝜃} = \nabla_{x} E_{𝜃} (x) .

The last step of the derivation works because $Z_{𝜃}$ does not depend on $x$ and hence $\nabla_{x} \log Z_{𝜃} = 0$ . The Langevin MCMC method, described in Algorithm 2, works exactly by levaraging $s_{𝜃} (x)$ . Again, the randomization in the * step helps the algorithm get out of local maximum and sample more space of $x$ .

$x : =$ norm_random()

until convergence:

: $x : = x + 𝜖 \cdot s_{𝜃} (x) + \sqrt{2 𝜖} \cdot$ norm_random() [*]

return $x$

Algorithm 2:Langevin MCMC method

3 Training

There are multiple diﬀerent ways of training EBMs. Some require sampling from the model being trained, and others do not.

3.1 Maximum Likelihood Method or Contrastive Divergence

Surprisingly, it is possible to conduct maximum likelihood optimization for EBM without modeling the PDF. Let’s start with a little math:

\begin{array}{l} \max E_{x \sim p (x)} \log p_{𝜃} (x) & = \max_{𝜃} E_{x \sim p (x)} [E_{𝜃} (x) - \log Z_{𝜃}] \\ = \max_{𝜃} [E_{x \sim p (x)} E_{𝜃} (x) - \log Z_{𝜃}] \end{array}

The likelihood gradient for updating $𝜃$ is

\begin{array}{l} \nabla_{𝜃} [E_{x \sim p (x)} E_{𝜃} (x) - \log Z_{𝜃}] & = E_{x \sim p (x)} \nabla_{𝜃} E_{𝜃} (x) - \nabla_{𝜃} \log Z_{𝜃} \\ = E_{x \sim p (x)} \nabla_{𝜃} E_{𝜃} (x) - \frac{1}{Z_{𝜃}} \nabla_{𝜃} Z_{𝜃} \\ = E_{x \sim p (x)} \nabla_{𝜃} E_{𝜃} (x) - \frac{1}{Z_{𝜃}} \nabla_{𝜃} [\int \exp (E_{𝜃} (x)) d x] \\ = E_{x \sim p (x)} \nabla_{𝜃} E_{𝜃} (x) - \frac{1}{Z_{𝜃}} \int \nabla_{𝜃} \exp (E_{𝜃} (x)) d x \\ = E_{x \sim p (x)} \nabla_{𝜃} E_{𝜃} (x) - \frac{1}{Z_{𝜃}} \int \exp (E_{𝜃} (x)) \nabla_{𝜃} E_{𝜃} (x) d x \\ = E_{x \sim p (x)} \nabla_{𝜃} E_{𝜃} (x) - \frac{1}{Z_{𝜃}} \int \exp (E_{𝜃} (x)) \nabla_{𝜃} E_{𝜃} (x) d x \\ = E_{x \sim p (x)} \nabla_{𝜃} E_{𝜃} (x) - \int \frac{1}{Z_{𝜃}} \exp (E_{𝜃} (x)) \nabla_{𝜃} E_{𝜃} (x) d x \\ = E_{x \sim p (x)} \nabla_{𝜃} E_{𝜃} (x) - \int p_{𝜃} (x) \nabla_{𝜃} E_{𝜃} (x) d x \\ = E_{x \sim p (x)} \nabla_{𝜃} E_{𝜃} (x) - E_{x \sim p_{𝜃} (x)} \nabla_{𝜃} E_{𝜃} (x) \end{array}

Note that the ﬁrst term involves $p (x)$ , the second term involves $p_{𝜃} (x)$ , and neither depends on $Z_{𝜃}$ . Also note that here the likelihood gradient is with respect to $𝜃$ , the parameters, not $x$ . Don’t confuse it with the score function, which is a gradient with respect to $x$ . The likelihood gradient points in the direction where the energy function’s gradient diﬀers the most between real samples and model samples. This is probably where the name Contrastive Divergence got its name.

The big picture here is that even though the partition function $Z_{𝜃}$ is not modeled, it is still possible to estimate the likelihood function’s gradient and conduct maximum likelihood training. Each training step though requires drawing samples from the model being trained, which can be expensive, as described in Section 2.

3.2 Score Matching

If $\nabla_{x} p (x)$ and $\nabla_{x} p_{𝜃} (x)$ are equal everywhere, then $p (x) = p_{𝜃} (x) +$ constant. Therefore, if $\nabla_{x} \log p (x)$ and $\nabla_{x} \log p_{𝜃} (x)$ are equal everywhere, then $\log p (x) = \log p_{𝜃} (x) +$ constant. That constant diﬀerence can be removed since both $p (x)$ and $p_{𝜃} (x)$ have to integrate to 1.

That is the key idea behind score matching, a method that matches the scores or score functions of two distributions everywhere as an alternative to maximum likelihood based training. The objective of score matching is to minimize the Fisher Divergence between $p (x)$ and $p_{𝜃} (x)$ :

\begin{array}{l} \min_{𝜃} F D (p (x), p_{𝜃} (x)) & = \min_{𝜃} \frac{1}{2} E_{x \sim p (x)} {∥ \nabla_{x} \log p (x) - \nabla_{x} \log p_{𝜃} (x) ∥}_{2}^{2} \\ = \min_{𝜃} \frac{1}{2} E_{x \sim p (x)} {∥ \nabla_{x} \log p (x) - \nabla_{x} E_{𝜃} (x) ∥}_{2}^{2} \end{array}

We’ll show how this objective can be manipulated to not dependent on the unknown $p (x)$ in the univariate case.

\begin{align} \min_{𝜃} \frac{1}{2} E_{x \sim p (x)} {∥ \nabla_{x} \log p (x) - \nabla_{x} E_{𝜃} (x) ∥}_{2}^{2} & = \min_{𝜃} \frac{1}{2} \int p (x) {[\log^{'} p (x) - E_{𝜃}^{^{'}} (x)]}^{2} \\ = \min_{𝜃} \frac{1}{2} \int p (x) [{(\log^{'} p (x))}^{2} + {(E_{𝜃}^{^{'}} (x))}^{2} - 2 \log^{'} p (x) E_{𝜃}^{^{'}} (x)] \\ = \min_{𝜃} [\frac{1}{2} \int p (x) {(\log^{'} p (x))}^{2} + \frac{1}{2} \int p (x) {(E_{𝜃}^{^{'}} (x))}^{2} - \int p (x) \log^{'} p (x) E_{𝜃}^{^{'}} (x)] & (2) \end{align}

The ﬁrst term does not depend on $𝜃$ , and can therefore be left out. The third term still has $\log^{'} p (x)$ in it. Recall the integration by parts formula states that

\int_{a}^{b} u (x) v^{'} (x) d x = u (x) v (x) |_{a}^{b} - \int_{a}^{b} u^{'} (x) v (x) d x

and it can be used the rewrite the third term:

\begin{align} \int p (x) \log^{'} p (x) E_{𝜃}^{^{'}} (x) & = \int p (x) \frac{1}{p (x)} p^{'} (x) E_{𝜃}^{^{'}} (x) \\ = \int p^{'} (x) E_{𝜃}^{^{'}} (x) \\ = p (x) E_{𝜃}^{^{'}} (x) |_{- \infty}^{+ \infty} - \int p (x) E_{𝜃}^{^{″}} (x) \\ = 0 - \int p (x) E_{𝜃}^{^{″}} (x) & (3) \\ = - E_{x \sim p (x)} E_{𝜃}^{^{″}} (x) & (4) \end{align}

The derivation of (3) makes the very reasonable assumption that $\lim_{+ \infty} p (x) = \lim_{- \infty} p (x) = 0$ .

Eliminating the ﬁrst term in (2), rewriting the second term in the expectation form, and subtituting the third term with (4), we have

\min_{𝜃} [\frac{1}{2} E_{x \sim p (x)} {(E_{𝜃}^{^{'}} (x))}^{2} + E_{x \sim p (x)} E_{𝜃}^{^{″}} (x)] = \min_{𝜃} E_{x \sim p (x)} [\frac{1}{2} {(E_{𝜃}^{^{'}} (x))}^{2} + E_{𝜃}^{^{″}} (x)] .

The multivariate version of the objective can be shown to be

\min_{𝜃} E_{x \sim p (x)} [\frac{1}{2} {∥ \nabla_{𝜃} E_{𝜃} (x) ∥}_{2}^{2} + t r (\nabla_{𝜃}^{2} E_{𝜃} (x))]

where the second term is the trace of the Hessian matrix of $E_{𝜃} (x)$ . Loosely speaking, the ﬁrst term tries to ﬁnd $𝜃$ such that the samples $x$ are the local maximums or minimums (with gradients as close to 0 as possible), and the second term tries to make sure it is actually local maximums (with second order gradients as negative as possible).

The score matching training method avoids the very expensive procedure of drawing samples from the model being trained. Its main expensive operation is the computation of the trace of the Hessian matrix. There are more research in this space that will be explored in future tutorials.

3.3 Noise Contrastive Estimation

Noise Contrastive Estimation (NCE) is another training method for EBMs without requiring drawing samples from the models being trained. Recall that in Generative Adversarial Networks (GAN) (amaires.github.io/GAN), given a ﬁxed Generator $G_{ϕ}$ , the optimal Discriminator $D_{𝜃}$ ’s output is

D_{𝜃^{*}} (x) = \frac{p (x)}{p (x) + p_{ϕ} (x)}

The result holds if $G_{ϕ}$ and $p_{ϕ}$ are replaced with any static known noise distribution $p_{n} (x)$ . That is

D_{𝜃^{*}} (x) = \frac{p (x)}{p (x) + p_{n} (x)}

Note here $n$ is not a parameter; it just means noise.

If $D_{𝜃}$ ’s neural network is explicitly constructed as

D_{𝜃} (x) = \frac{F_{𝜃} (x)}{F_{𝜃} (x) + p_{n} (x)}

then

D_{𝜃^{*}} (x) = \frac{p (x)}{p (x) + p_{n} (x)} ≃ \frac{F_{𝜃^{*}} (x)}{F_{𝜃^{*}} (x) + p_{n} (x)}

solving it basically shows that $F_{𝜃^{*}} (x) ≃ p (x)$ which also means $F_{𝜃} (x)$ is automatically normalized if all stars are aligned. Now if $F_{𝜃} (x)$ is replaced with an energy function based PDF function

\frac{\exp (E_{𝜃} (x))}{Z}

where $Z$ is an additional parameter, which is not guaranteed to be equal to $E_{𝜃} (x)$ ’s partition function $Z_{𝜃}$ , we have

D_{𝜃, Z} (x) = \frac{\frac{\exp (E_{𝜃} (x))}{Z}}{\frac{\exp (E_{𝜃} (x))}{Z} + p_{n} (x)} = \frac{\exp (E_{𝜃} (x))}{\exp (E_{𝜃} (x)) + Z p_{n} (x)}

(5)

and

\frac{\exp (E_{𝜃^{*}} (x))}{Z^{*}} ≃ p (x)

where $E_{𝜃^{*}}$ would be our trained energy model.

With $D_{𝜃}$ constructed as in (5), $D_{𝜃}$ ’s optimization objective becomes

\begin{array}{l} \max_{𝜃, Z} E_{x \sim p (x)} \log D_{𝜃, Z} (x) + E_{x \sim p_{n} (x)} \log (1 - D_{𝜃, Z} (x)) \\ = & \max_{𝜃, Z} E_{x \sim p (x)} [E_{𝜃} (x) - \log (\exp (E_{𝜃} (x)) + Z p_{n} (x)] + E_{x \sim p_{n} (x)} [\log (Z p_{n} (x)) - \log (\exp (E_{𝜃} (x)) + Z p_{n} (x)] \end{array}

3.4 Flow Contrastive Estimation

In theory, there are no requirements on the static noise distribution $p_{n} (x)$ for NCE. In practice, the closer $p_{n} (x)$ is to $p$ (but not identical), the more eﬀective NCE is. Flow Contrastive Estimation parameterizes $p_{n} (x)$ as $p_{ϕ} (x)$ with a normalizing ﬂow model because normalizing ﬂow models are easy to sample and give tractable PDF. The discriminator is now modeled as

D_{𝜃, Z, ϕ} (x) = \frac{\exp (E_{𝜃} (x))}{\exp (E_{𝜃} (x)) + Z p_{ϕ} (x)}

and the objective function is

\max_{𝜃, Z} \min_{ϕ} E_{x \sim p (x)} \log (D_{𝜃, Z, ϕ} (x)) + E_{x \sim p_{ϕ} (x)} \log (1 - D_{𝜃, Z, ϕ} (x))