Progressive Distillation Method

Idea

The core idea involves a distillation process where a pre-trained "teacher" model, which requires a high number of sampling steps (up to 8192), is used to train a "student" model that takes significantly fewer steps. This distillation is progressively applied, effectively halving the number of required sampling steps at each iteration. Remarkably, this approach allows for models that can generate samples in as few as 4 steps while still maintaining high perceptual quality

Method

Require:

Trained teacher model $\overset{x}{^}_{η} (z_{t})$
Data set $D$
Loss weight function $w ()$
Student sampling steps $N$

For $K$ iterations do

$θ \leftarrow η$ ⟶ Init student from teacher While not converged do
- $x \sim D$
- $t = i / N, i \sim Cat [1, 2, \dots, N]$
- $ϵ \sim N (0, I)$
- $z_{t} = α_{t} x + σ_{t} ϵ$
- 2 steps of DDIM with teacher
- $t^{'} = t - 0.5/ N, t^{''} = t - 1/ N$
- $z_{t^{'}} = α_{t^{'}} \overset{x}{^}_{η} (z_{t}) + \frac{σ _{t^{'}}}{σ _{t}} (z_{t} - α_{t} \overset{x}{^}_{η} (z_{t}))$
- $z_{t^{''}} = α_{t^{''}} \overset{x}{^}_{η} (z_{t^{'}}) + \frac{σ _{t^{''}}}{σ _{t^{'}}} (z_{t^{'}} - α_{t^{'}} \overset{x}{^}_{η} (z_{t^{'}}))$
- $\overset{x}{^} = \frac{z _{t}^{''} - ( σ _{t^{''}} / σ _{t} ) z _{t}}{α _{t^{''}} - ( σ _{t^{'}} / σ _{t} ) α _{t}}$ ⟶ Teacher $\overset{x}{^}$ target
- $λ_{t} = lo g (\frac{α _{t}^{2}}{σ _{t}^{2}})$
- $L_{θ} = w (λ_{t}) ∥ \overset{x}{^} - \overset{x}{^}_{θ} (z_{t}) ∥_{2}^{2}$
- $θ \leftarrow θ - γ \nabla_{θ} L_{θ}$ End while
$η \leftarrow θ$ ⟶ Student becomes next teacher
$N \leftarrow N /2$ ⟶ Halve number of sampling steps End for

Where the target $\overset{x}{^}$ is set to $\frac{z _{t}^{''} - ( σ _{t^{''}} / σ _{t} ) z _{t}}{α _{t^{''}} - ( σ _{t^{'}} / σ _{t} ) α _{t}}$ such that if we use the student model to sample $1$ DDIM step from $z_{t}$ using the mean determined by $\overset{x}{^}$ (with zero variance in DDIM), then the denoised result $z_{t^{'}}^{'}$ should equal to $z_{t^{''}}$ , which the the denoised result of the teacher model after $2$ DDIM steps with the mean determined by $\overset{x}{^}_{η}$

New parameterization method: v-prediction

Idea

The authors proposed an alternative method of v-prediction, which involves predicting a latent variable $v$ instead of directly predicting the noise $ϵ$ . This approach provides more stable training dynamics, particularly when distilling diffusion models to fewer steps. The relationship between $v$ and $ϵ$ is defined through the following key ideas:

$v$ is a transformation of both the original data and the noise.
By predicting $v$ , the model can smoothly transition between steps, allowing it to interpolate between predicting the noise $ϵ$ directly and predicting the clean data.

Formulation

For two times $s, t \in [0, 1]$ , we want to update from $t$ to $s$ ( $t > s$ ), and the update rule given by DDIM is

z_{s} = α_{s} x_{s} + σ_{s} ϵ_{s} = α_{s} \overset{x}{^}_{θ} (z_{t}) + σ_{s} \frac{z _{t} - α _{t} x ^ _{θ} ( z _{t} )}{σ _{t}}

We can simplify the DDIM update rule by expressing it in terms of $ϕ_{t} = arctan (σ_{t} / α_{t})$ . Assuming a variance preserving diffusion process, we have $α_{ϕ} = cos (ϕ), σ_{ϕ} = sin (ϕ)$ , and hence

z_{ϕ} = cos (ϕ) x + sin (ϕ) ϵ

And we define the velocity of $z_{ϕ}$

v_{ϕ} = \frac{d z _{ϕ}}{d ϕ} = cos (ϕ) ϵ - sin (ϕ) x

Furthermore, we define the predicted velocity as

\overset{v}{^}_{θ} (z_{ϕ}) = cos (ϕ) \overset{ϵ}{^}_{θ} (z_{ϕ}) - sin (ϕ) \overset{x}{^}_{θ} (z_{ϕ})

where $\overset{ϵ}{^}_{θ} (z_{ϕ}) = (z_{ϕ} - cos (ϕ) \overset{x}{^}_{θ} (z_{ϕ})) / sin (ϕ)$ . And now we have

z_{ϕ_{s}} = cos (ϕ_{s}) \overset{x}{^}_{θ} (z_{ϕ_{t}}) + sin (ϕ_{s}) \overset{ϵ}{^}_{θ} (z_{ϕ_{t}}) = cos (ϕ_{s}) (cos (ϕ_{t}) z_{ϕ_{t}} - sin (ϕ_{t}) \overset{v}{^}_{θ} (z_{ϕ_{t}})) + sin (ϕ_{s}) (sin (ϕ_{t}) z_{ϕ_{t}} + cos (ϕ_{t}) \overset{v}{^}_{θ} (z_{ϕ_{t}})) = cos (ϕ_{s} - ϕ_{t}) z_{ϕ_{t}} + sin (ϕ_{s} - ϕ_{t}) \overset{v}{^}_{θ} (z_{ϕ_{t}})

Viewed from this perspective, DDIM thus evolves $z_{ϕ_{s}}$ by moving it on a circle in the $(z_{ϕ_{t}}, \overset{v}{^}_{ϕ_{t}})$ basis, along the $- \overset{v}{^}_{ϕ_{t}}$ direction, gradually denoising smoothly towards $x$

Lin's Notes Garden

Explorer

Progressive Distillation for Fast Sampling of Diffusion Models

Progressive Distillation Method

Idea

Method

New parameterization method: v-prediction

Idea

Formulation

Graph View

Table of Contents