Optimization Objectives
Amaires@May 2024
1 Introduction
Given samples from a distribution
with probability density function ,
the optimization goal of a classiο¬cation problem or a regression problem is to ο¬nd a good
where
is the parameter of a chosen family of probability density functions. The objective can be derived in three diο¬erent but related
ways.
1.1 K-L divergence of conditional distribution
One criterion of a good is how close it
is to . One such closeness measure
is the K-L divergence between
and which is
. Of course, this should
work across all ,
therefore our objective should be
|
|
For our purpose, the ο¬rst term is an unknown consant independent of
. Removing
this constant, our objective changes to
The left side of the above equation can also be written as:
So basically, the optimization objective is the following three equivalent functions:
1.2 K-L divergence of joint distribution
Since ,
it is easy to arrive at the same conclusions by minimizing the K-L divergence between
and
:
Again, the ο¬rst term is an unknown constant independent of
that can be removed. The objective changes to
1.3 Maximum likelihood
Given a set of samples ,
assumed to be i.i.d, one objective could be to maximize the likelihood of observing these samples, which is
|
|
This is equivalent to minimizing the negative log likelihood
As before, the ο¬rst term is an unknown constant independent of .
Once the ο¬rst term is removed, the ojective becomes
|
|
Divide it by the number of samples, and rewirte it in expectation form, the objective becomes
|
|
This is the same as what is derived in Section 1.1 and Section 1.2.
2 Classiο¬cation
In a classiο¬cation problem,
takes on a ο¬xed number of possible values usually encoded using numbers from 1 through
. A classiο¬er usually outputs
the entire probability vector .
In the case of a binary classiο¬cation problem, however, it is more customary to use
to encode the two possible
values that can take, and
the classiο¬er only outputs
with implied to
be . In this
case, the optimization objective can be rewritten as
|
|
This is usual called the binary cross entropy objective.
3 Regression
In a regression problem, a neural networkβs output can be interpreted as the mean
of a normal
distribution .
With this interpretation, the optimization objective can be rewritten as
where is the
dimension of .
This objective is equivalent to
|
|
which is the well known mean squared error objective.