Batch Normalization

Batch normalization is a technique that normalizes the inputs of each layer to have zero mean and unit variance during training. This helps in stabilizing and accelerating the training process by reducing the internal covariate shift, which refers to the change in the distribution of network activations due to the updates of the parameters in the previous layers. Batch normalization allows the use of higher learning rates and makes the training less sensitive to the initial weights

Consider a batch of activations at some layer. To make each dimension unit gaussian(with zero mean and unit variance), we just need to apply:

\overset{x}{^}^{(k)} = \frac{x ^{(k)} - E [ x ^{(k)} ]}{Var [ x ^{(k)} ]}

Now we could come up with the batch normalization layer:

Input: Values of $x$ over a mini-batch: $B = {x_{1}, \dots, x_{m}}$ ; Parameters to be learned: $γ, β$
Output: ${y_{i} = BN_{γ, β} (x_{i})}$
Algorithm:

μ_{B} = \frac{1}{m} i = 1 \sum m x_{i} σ_{B}^{2} = \frac{1}{m} i = 1 \sum m (x_{i} - μ_{B})^{2} \overset{x}{^}_{i} = \frac{x _{i} - μ _{B}}{σ _{B}^{2} + ϵ} y_{i} = γ \overset{x}{^}_{i} + β \equiv BN_{γ, β} (x_{i})

Where the $ϵ$ is a small constant to prevent division by zero when normalizing the input features.

This layer could

Improve gradient flow through the network
Allow higher learning rates
Reduces the strong dependence on initialization
Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe

Lin's Notes Garden

Explorer

Batch Normalization

Graph View

Backlinks