Optimization Objectives

Amaires@May 2024

1 Introduction

Given samples $(x, y)$ from a distribution with probability density function $p (x, y)$ , the optimization goal of a classiﬁcation problem or a regression problem is to ﬁnd a good $p_{𝜃} (y | x)$ where $𝜃$ is the parameter of a chosen family of probability density functions. The objective can be derived in three diﬀerent but related ways.

1.1 K-L divergence of conditional distribution

One criterion of a good $p_{𝜃} (y | x)$ is how close it is to $p (y | x)$ . One such closeness measure is the K-L divergence between $p (y | x)$ and $p_{𝜃} (y | x)$ which is $\int p (y | x) \log \frac{p (y | x)}{p_{𝜃} (y | x)} d y$ . Of course, this should work across all $x$ , therefore our objective should be

\min_{𝜃} \int p (x) [\int p (y | x) \log \frac{p (y | x)}{p_{𝜃} (y | x)} d y] d x = \min_{𝜃} [\int p (x) [\int p (y | x) \log p (y | x) d y] d x - \int p (x) [\int p (y | x) \log p_{𝜃} (y | x) d y] d x]

For our purpose, the ﬁrst term is an unknown consant independent of $𝜃$ . Removing this constant, our objective changes to

\begin{array}{l} \min_{𝜃} - \int p (x) [\int p (y | x) \log p_{𝜃} (y | x) d y] d x & = \min_{𝜃} - E_{x \sim p (x)} E_{y \sim p (y | x)} \log p_{𝜃} (y | x) \\ = \min_{𝜃} - E_{x, y \sim p (x, y)} \log p_{𝜃} (y | x) \end{array}

The left side of the above equation can also be written as:

\begin{array}{l} \min_{𝜃} - \int p (x) [\int p (y | x) \log p_{𝜃} (y | x) d y] d x & = \min_{𝜃} - \int \int p (x) p (y | x) \log p_{𝜃} (y | x) d y d x \\ = \min_{𝜃} - \int \int p (x, y) \log p_{𝜃} (y | x) d y d x \\ = \min_{𝜃} - \int p (y) \int p (x | y) \log p_{𝜃} (y | x) d x d y \\ = \min_{𝜃} - E_{y \sim p (y)} E_{x \sim p (x | y)} \log p_{𝜃} (y | x) \\ = \min_{𝜃} - E_{x, y \sim p (x, y)} \log p_{𝜃} (y | x) \end{array}

So basically, the optimization objective is the following three equivalent functions:

\begin{array}{l} \min_{𝜃} - E_{x \sim p (x)} E_{y \sim p (y | x)} \log p_{𝜃} (y | x) \\ \min_{𝜃} - E_{y \sim p (y)} E_{x \sim p (x | y)} \log p_{𝜃} (y | x) \\ \min_{𝜃} - E_{x, y \sim p (x, y)} \log p_{𝜃} (y | x) \end{array}

1.2 K-L divergence of joint distribution

Since $p_{𝜃} (x, y) = p (x) p_{𝜃} (x, y)$ , it is easy to arrive at the same conclusions by minimizing the K-L divergence between $p (x, y)$ and $p_{𝜃} (x, y)$ :

\begin{array}{l} \min_{𝜃} \int \int p (x, y) \log \frac{p (x, y)}{p_{𝜃} (x, y)} d x d y & = \min_{𝜃} \int \int p (x, y) \log \frac{p (x) p (y | x)}{p (x) p_{𝜃} (y | x)} d x d y \\ = \min_{𝜃} \int \int p (x, y) \log \frac{p (y | x)}{p_{𝜃} (y | x)} d x d y \\ = \min_{𝜃} [\int \int p (x, y) \log p (y | x) d x d y - \int \int p (x, y) \log p_{𝜃} (y | x) d x d y] \end{array}

Again, the ﬁrst term is an unknown constant independent of $𝜃$ that can be removed. The objective changes to

\begin{array}{l} \min_{𝜃} - \int \int p (x, y) \log p_{𝜃} (y | x) d x d y & = \min_{𝜃} E_{x, y \sim p (x, y)} \log p_{𝜃} (y | x) \\ = \min_{𝜃} \int p (x) [\int p (y | x) \log p_{𝜃} (y | x) d y] d x = \min_{𝜃} E_{x \sim p (x)} E_{y \sim p (y | x)} \log p_{𝜃} (y | x) \\ = \min_{𝜃} \int p (y) [\int p (x | y) \log p_{𝜃} (y | x) d x] d y = \min_{𝜃} E_{y \sim p (y)} E_{x \sim p (x | y)} \log p_{𝜃} (y | x) \end{array}

1.3 Maximum likelihood

Given a set of samples $(x_{i}, y_{i})$ , assumed to be i.i.d, one objective could be to maximize the likelihood of observing these samples, which is

\max_{𝜃} \prod_{i} p_{𝜃} (x_{i}, y_{i})

This is equivalent to minimizing the negative log likelihood

\begin{array}{l} \min - \sum_{i} \log p_{𝜃} (x_{i}, y_{i}) & = \min_{𝜃} - \sum_{i} \log p (x_{i}) p (y_{i} | x_{i}) \\ = \min_{𝜃} - (\sum_{i} \log p (x_{i}) + \sum_{i} \log p_{𝜃} (y_{i} | x_{i})) \end{array}

As before, the ﬁrst term is an unknown constant independent of $𝜃$ . Once the ﬁrst term is removed, the ojective becomes

\min_{𝜃} - \sum_{i} \log p_{𝜃} (y_{i} | x_{i})

Divide it by the number of samples, and rewirte it in expectation form, the objective becomes

\min_{𝜃} - E_{x, y \sim p (x, y)} \log p_{𝜃} (y | x)

This is the same as what is derived in Section 1.1 and Section 1.2.

2 Classiﬁcation

In a classiﬁcation problem, $y$ takes on a ﬁxed number of possible values usually encoded using numbers from 1 through $K$ . A classiﬁer usually outputs the entire probability vector $p_{𝜃} (y = 1 | x), p_{𝜃} (y = 2 | x), p_{𝜃} (y = 3 | x), \dots, p_{𝜃} (y = K | x)$ . In the case of a binary classiﬁcation problem, however, it is more customary to use ${0, 1}$ to encode the two possible values that $y$ can take, and the classiﬁer only outputs $f (x) = p_{𝜃} (y = 1 | x)$ with $p_{𝜃} (y = 0 | x)$ implied to be $1 - f (x)$ . In this case, the optimization objective can be rewritten as

\min_{𝜃} - E_{x, y \sim p (x, y)} (y \log f (x) + (1 - y) \log (1 - f (x))

This is usual called the binary cross entropy objective.

3 Regression

In a regression problem, a neural network’s output can be interpreted as the mean $μ_{𝜃} (x)$ of a normal distribution $N (μ_{𝜃} (x), I)$ . With this interpretation, the optimization objective can be rewritten as

\begin{array}{l} \min_{𝜃} - E_{x, y \sim p (x, y)} \log p_{𝜃} (y | x) & = \min_{𝜃} - E_{x, y \sim p (x, y)} \log {(2 π)}^{- \frac{d}{2}} \exp (- \frac{1}{2} {∥ y - μ_{𝜃} (x) ∥}^{2}) \\ = - \frac{d}{2} \log (2 π) + \frac{1}{2} \min_{𝜃} E_{x, y \sim p (x, y)} {∥ y - μ_{𝜃} (x) ∥}^{2} \end{array}

where $d$ is the dimension of $y$ . This objective is equivalent to

\min_{𝜃} E_{x, y \sim p (x, y)} {∥ y - μ_{𝜃} (x) ∥}^{2}

which is the well known mean squared error objective.