Batch normalization is a technique that normalizes the inputs of each layer to have zero mean and unit variance during training. This helps in stabilizing and accelerating the training process by reducing the internal covariate shift, which refers to the change in the distribution of network activations due to the updates of the parameters in the previous layers. Batch normalization allows the use of higher learning rates and makes the training less sensitive to the initial weights

Consider a batch of activations at some layer. To make each dimension unit gaussian(with zero mean and unit variance), we just need to apply:

Now we could come up with the batch normalization layer:

  • Input: Values of over a mini-batch: ; Parameters to be learned:
  • Output:
  • Algorithm:

Where the is a small constant to prevent division by zero when normalizing the input features.

This layer could

  • Improve gradient flow through the network
  • Allow higher learning rates
  • Reduces the strong dependence on initialization
  • Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe