In neural networks, Xavier initialization helps maintain consistent variance of activations and gradients throughout the network layers. This prevents the vanishing gradients and exploding gradient as they propagate through the network during training
Consider the scale distribution of an output for some fully connected layer without nonlinearities. With inputs and their associated weights for this layer, an output is given by
The weights are all drawn independently from the same distribution. Assume the distribution has zero mean and variance , and the inputs to the layer also have zero man and variance and that they are independent of and independent of each other. In this case, we can compute the mean of
and the variance:
One way to keep the variance fixed is to set . Now consider backpropagation. There we face a similar problem, albeit with gradients being propagated from the layers closer to the output. Using the same reasoning as for forward propagation, we see that the gradients’ variance can blow up unless , where is the number of outputs of this layer. This leaves us in a dilemma: we cannot possibly satisfy both conditions simultaneously. Instead, we simply try to satisfy:
This is the reasoning underlying the now-standard and practically beneficial Xavier initialization
We can also adapt this to choose the variance when sampling weights from a uniform distribution. Note that the uniform distribution has variance . Plugging into our condition on prompts us to initialize according to