Xavier Initialization

In neural networks, Xavier initialization helps maintain consistent variance of activations and gradients throughout the network layers. This prevents the vanishing gradients and exploding gradient as they propagate through the network during training

Consider the scale distribution of an output $o_{i}$ for some fully connected layer without nonlinearities. With $n_{in}$ inputs $x_{j}$ and their associated weights $w_{ij}$ for this layer, an output is given by

o_{i} = j = 1 \sum n_{in} w_{ij} x_{j}

The weights $w_{ij}$ are all drawn independently from the same distribution. Assume the distribution has zero mean and variance $σ^{2}$ , and the inputs to the layer $x_{j}$ also have zero man and variance $γ^{2}$ and that they are independent of $w_{ij}$ and independent of each other. In this case, we can compute the mean of $o_{i}$

E [o_{i}] = j = 1 \sum n_{in} E [w_{ij} x_{j}] = j = 1 \sum n_{in} E [w_{ij}] E [x_{j}] = 0

and the variance:

Var [o_{i}] = E [o_{i}^{2}] - E^{2} [o_{i}] = j = 1 \sum n_{in} E [w_{ij}^{2}] E [x_{j}^{2}] - 0 = n_{in} σ^{2} γ^{2}

One way to keep the variance fixed is to set $n_{in} σ^{2} = 1$ . Now consider backpropagation. There we face a similar problem, albeit with gradients being propagated from the layers closer to the output. Using the same reasoning as for forward propagation, we see that the gradients' variance can blow up unless $n_{out} σ^{2} = 1$ , where $n_{out}$ is the number of outputs of this layer. This leaves us in a dilemma: we cannot possibly satisfy both conditions simultaneously. Instead, we simply try to satisfy:

\frac{1}{2} (n_{in} + n_{out}) σ^{2} = 1 ⟺ σ = \frac{2}{n _{in} + n _{out}}

This is the reasoning underlying the now-standard and practically beneficial Xavier initialization

We can also adapt this to choose the variance when sampling weights from a uniform distribution. Note that the uniform distribution $U (- a, a)$ has variance $\frac{a ^{2}}{3}$ . Plugging $\frac{a ^{2}}{3}$ into our condition on $σ^{2}$ prompts us to initialize according to

U (- \frac{6}{n _{in} + n _{out}}, \frac{6}{n _{in} + n _{out}})

Lin's Notes Garden

Explorer

Xavier Initialization

Graph View

Backlinks