Idea

The cross entropy loss is often used in classifier models in machine learning, where the model are supposed to obtain a probability distribution among the suspected categories which the input image may belong to.

For a input image $x_{i}$ , we have the predicted class distribution obtained by a model with parameters $θ$ , $P (y ∣ x_{i}; θ)$ ; and the true class distribution, $P^{*} (y ∣ x_{i})$ . The loss function should indicate the similarity between these two distributions, and a natural way to do this is by KL divergence:

D_{KL} (P^{*} ∥ P) = y \sum P^{*} (y ∣ x_{i}) lo g \frac{P ^{*} ( y ∣ x _{i} )}{P ( y ∣ x _{i} ; θ )} = y \sum P^{*} (y ∣ x_{i}) lo g P^{*} (y ∣ x_{i}) - y \sum P^{*} (y ∣ x_{i}) lo g P (y ∣ x_{i}; θ)

Note that the first term is constant, so we just have

θ argmin D_{KL} (P^{*} ∥ P) = θ argmin [- y \sum P^{*} (y ∣ x_{i}) lo g P (y ∣ x_{i}; θ)]

and we could define the cross entropy loss as

H (P^{*}, P) = - y \sum P^{*} (y ∣ x_{i}) lo g P (y ∣ x_{i}; θ)

In conclusion, the cross entropy loss is the natural loss based on KL divergence.

Form with Softmax

If we define $y_{i}$ as the true distribution $P^{*} (y ∣ x_{i})$ , and $\overset{y}{^}_{i}$ as the predicted distribution generalized by softmax activation function $softmax (o_{i})$ , then the cross entropy loss is

H (y ∣ \overset{y}{^}) = - i \sum y_{i} lo g \overset{y}{^}_{i} = - i \sum y_{i} lo g [\frac{exp ( o _{i} )}{\sum _{j} exp ( o _{j} )}] = i \sum y_{i} lo g [j \sum exp (o_{j})] - i \sum y_{i} o_{i}

And the derivative of $H (y ∣ \overset{y}{^})$ with respect to any logit $o_{i}$ would be

\partial_{o_{i}} H (y ∣ \overset{y}{^}) = \frac{exp ( o _{i} )}{\sum _{j} exp ( o _{j} )} - y_{i} = softmax (o_{i}) - y_{i}

Lin's Notes Garden

Explorer

Cross Entropy Loss

Idea

Form with Softmax

Graph View

Table of Contents

Backlinks