Idea

The cross entropy loss is often used in classifier models in machine learning, where the model are supposed to obtain a probability distribution among the suspected categories which the input image may belong to.

For a input image , we have the predicted class distribution obtained by a model with parameters , ; and the true class distribution, . The loss function should indicate the similarity between these two distributions, and a natural way to do this is by KL divergence:

Note that the first term is constant, so we just have

and we could define the cross entropy loss as

In conclusion, the cross entropy loss is the natural loss based on KL divergence.

Form with Softmax

If we define as the true distribution , and as the predicted distribution generalized by softmax activation function , then the cross entropy loss is

And the derivative of with respect to any logit would be