Direct Guidance

In DDPM, we focus on modeling just the data distribution . However, we are often also interested in learning conditional distribution , which would enable us to explicitly control the data we generate through conditional information .

A natural way to add conditioning information is simply alongside the timestep information, at each iteration. Recall that from the joint distribution of can be derived from the product of transition distributions

We can simply add arbitrary conditioning information at each transition step as

where could be a text encoding in image-text generation, or a low-resolution image to perform super-resolution on. Now we can learn the core neural network of a DDPM as before.

However, a caveat of this vanilla formulation is that a conditional diffusion model trained in this way may potentially learn to ignore or downplay any given conditioning information. Guidance is therefore proposed as a way to more explicitly control the amount of weight the model gives to the conditioning information, at the cost of sample diversity

Classifier Guidance

Strat with the score-based formulation of a diffusion model, where our goal is to learn . By Bayes' rule, we can derive

Therefore, in Classifier Guidance, the score of an unconditional diffusion model is learned as previously derived, alongside a classifier that takes in arbitrary noisy and attempts to predict conditional information . Then, during the sampling procedure, the overall conditional score function used for annealed Langevin Dynamics is computed as the sum of the unconditional score function and the adversarial gradient of the noisy classifier

To introduce fine-grained control to either encourage or discourage the model to consider the conditioning information, we can scales the adversarial gradient of the noisy classifier by a hyper-parameter

The higher is, the model learns to produce samples that heavier adhere to the conditioning information, which comes at the cost of sample diversity.

Here is a pseudo-code of classifier guidance:

# Load a pre-trained image classification model, this model should be trained on images with noises 
classifier_model = ...  
 
# We want to generate an image of class 1, let's assume class 1 corresponds to the "cat" category  
y = 1  
 
# Controls the strength of the class guidance, the higher the stronger  
guidance_scale = 7.5  
 
# Randomly draw noise with the same shape as the output image from a Gaussian distribution  
input = get_noise(...) 
 
# Each step denoises the input
for t in tqdm(scheduler.timesteps):  
 
    # Use U-Net to predict noise (that is, the first term in the euqation above, since the score can be represented by the noise added form x_0 to x_t)  
    with torch.no_grad():  
        noise_pred = model(input, t).sample  
  
    # Classifier guidance step. Pass the input into a classifier, and get a predicted \hat{y}, then return the gradient given the ground-truth y
    class_guidance = classifier_model.get_class_guidance(input, y)
    
	# Compute gradient  
    noise_pred += class_guidance * guidance_scale  # Apply the gradient   
  
    # Calculate x_{t-1} using the updated noise  
    input = scheduler.step(noise_pred, t, input).prev_sample

Classifier Guidance can only control the categories generated by the classification model. If the classification model distinguishes 10 classes, then Classifier Guidance can only guide the diffusion model to generate those fixed 10 classes.

To solve this problem, see Classifier-Free Diffusion Guidance