Optimization Objectives


Amaires@May 2024

1 Introduction

Given samples (x,y) from a distribution with probability density function p(x,y), the optimization goal of a classification problem or a regression problem is to find a good pπœƒ(y|x) where πœƒ is the parameter of a chosen family of probability density functions. The objective can be derived in three different but related ways.

1.1 K-L divergence of conditional distribution

One criterion of a good pπœƒ(y|x) is how close it is to p(y|x). One such closeness measure is the K-L divergence between p(y|x) and pπœƒ(y|x) which is ∫ p(y|x)log p(y|x) pπœƒ(y|x)dy . Of course, this should work across all x, therefore our objective should be

min πœƒ ∫ p(x)[∫ p(y|x)log p(y|x) pπœƒ(y|x)dy]dx = min πœƒ[∫ p(x)[∫ p(y|x)log p(y|x)dy]dx βˆ’βˆ« p(x)[∫ p(y|x)log pπœƒ(y|x)dy]dx]

For our purpose, the first term is an unknown consant independent of πœƒ. Removing this constant, our objective changes to

min πœƒ βˆ’βˆ« p(x)[∫ p(y|x)log pπœƒ(y|x)dy]dx = min πœƒ βˆ’ Ex∼p(x)Ey∼p(y|x) log pπœƒ(y|x) = min πœƒ βˆ’ Ex,y∼p(x,y) log pπœƒ(y|x)

The left side of the above equation can also be written as:

min πœƒ βˆ’βˆ« p(x)[∫ p(y|x)log pπœƒ(y|x)dy]dx = min πœƒ βˆ’βˆ« ∫ p(x)p(y|x)log pπœƒ(y|x)dydx = min πœƒ βˆ’βˆ« ∫ p(x,y)log pπœƒ(y|x)dydx = min πœƒ βˆ’βˆ« p(y)∫ p(x|y)log pπœƒ(y|x)dxdy = min πœƒ βˆ’ Ey∼p(y)Ex∼p(x|y) log pπœƒ(y|x) = min πœƒ βˆ’ Ex,y∼p(x,y) log pπœƒ(y|x)

So basically, the optimization objective is the following three equivalent functions:

min πœƒ βˆ’ Ex∼p(x)Ey∼p(y|x) log pπœƒ(y|x) min πœƒ βˆ’ Ey∼p(y)Ex∼p(x|y) log pπœƒ(y|x) min πœƒ βˆ’ Ex,y∼p(x,y) log pπœƒ(y|x)

1.2 K-L divergence of joint distribution

Since pπœƒ(x,y) = p(x)pπœƒ(x,y), it is easy to arrive at the same conclusions by minimizing the K-L divergence between p(x,y) and pπœƒ(x,y):

min πœƒ ∫ ∫ p(x,y)log p(x,y) pπœƒ(x,y)dxdy = min πœƒ ∫ ∫ p(x,y)log p(x)p(y|x) p(x)pπœƒ(y|x)dxdy = min πœƒ ∫ ∫ p(x,y)log p(y|x) pπœƒ(y|x)dxdy = min πœƒ[∫ ∫ p(x,y)log p(y|x)dxdy βˆ’βˆ« ∫ p(x,y)log pπœƒ(y|x)dxdy]

Again, the first term is an unknown constant independent of πœƒ that can be removed. The objective changes to

min πœƒ βˆ’βˆ« ∫ p(x,y)log pπœƒ(y|x)dxdy = min πœƒEx,y∼p(x,y) log pπœƒ(y|x) = min πœƒ ∫ p(x)[∫ p(y|x)log pπœƒ(y|x)dy]dx = min πœƒEx∼p(x)Ey∼p(y|x) log pπœƒ(y|x) = min πœƒ ∫ p(y)[∫ p(x|y)log pπœƒ(y|x)dx]dy = min πœƒEy∼p(y)Ex∼p(x|y) log pπœƒ(y|x)

1.3 Maximum likelihood

Given a set of samples (xi,yi), assumed to be i.i.d, one objective could be to maximize the likelihood of observing these samples, which is

max πœƒ ∏ ipπœƒ(xi,yi)

This is equivalent to minimizing the negative log likelihood

min βˆ’βˆ‘ i log pπœƒ(xi,yi) = min πœƒ βˆ’βˆ‘ i log p(xi)p(yi|xi) = min πœƒ βˆ’ (βˆ‘ i log p(xi) + βˆ‘ i log pπœƒ(yi|xi))

As before, the first term is an unknown constant independent of πœƒ. Once the first term is removed, the ojective becomes

min πœƒ βˆ’βˆ‘ i log pπœƒ(yi|xi)

Divide it by the number of samples, and rewirte it in expectation form, the objective becomes

min πœƒ βˆ’ Ex,y∼p(x,y) log pπœƒ(y|x)

This is the same as what is derived in Section 1.1 and Section 1.2.

2 Classification

In a classification problem, y takes on a fixed number of possible values usually encoded using numbers from 1 through K. A classifier usually outputs the entire probability vector pπœƒ(y = 1|x),pπœƒ(y = 2|x),pπœƒ(y = 3|x),β‹―,pπœƒ(y = K|x). In the case of a binary classification problem, however, it is more customary to use {0,1} to encode the two possible values that y can take, and the classifier only outputs f(x) = pπœƒ(y = 1|x) with pπœƒ(y = 0|x) implied to be 1 βˆ’ f(x). In this case, the optimization objective can be rewritten as

min πœƒ βˆ’ Ex,y∼p(x,y)(ylog f(x) + (1 βˆ’ y)log (1 βˆ’ f(x))

This is usual called the binary cross entropy objective.

3 Regression

In a regression problem, a neural network’s output can be interpreted as the mean ΞΌπœƒ(x) of a normal distribution N(ΞΌπœƒ(x),I). With this interpretation, the optimization objective can be rewritten as

min πœƒ βˆ’ Ex,y∼p(x,y) log pπœƒ(y|x) = min πœƒ βˆ’ Ex,y∼p(x,y) log (2Ο€)βˆ’d 2 exp (βˆ’1 2 βˆ₯y βˆ’ ΞΌπœƒ(x)βˆ₯2) = βˆ’d 2log (2Ο€) + 1 2min πœƒEx,y∼p(x,y) βˆ₯y βˆ’ ΞΌπœƒ(x)βˆ₯2

where d is the dimension of y. This objective is equivalent to

min πœƒEx,y∼p(x,y) βˆ₯y βˆ’ ΞΌπœƒ(x)βˆ₯2

which is the well known mean squared error objective.