Large Guidance Weight Samplers

In Classifier-Free Guidance, increasing the classifier-free guidance weight improves image-text alignment, but damages image fidelity producing highly saturated and unnatural images. This is due to a train-test mismatch arising from high guidance weights.

At each sampling step , the -prediction must be within the same bounds as training data , i.e. within , but empirically that high guidance weights cause -predictions to exceed these bounds.

Static Thresholding

Elementwise clipping the -prediction to

Dynamic thresholding

At each sampling step set to a certain percentile absolute value in , and if , then threshold to the range and then divide by .

Robust Cascaded Diffusion Models

Imagen starts with a 64x64 image generated by a base model, two super-resolution diffusion models are then used to progressively upscale this image:

  • First step: Upsampling the 64x64 image to 256x256.
  • Second step: Further upsampling the 256x256 image to 1024x1024.

These super-resolution models take the low-resolution image and use text-conditional inputs to generate higher resolutions while preserving quality and coherence with the text description.

Noise Conditioning Augmentation

A key aspect that enhances the performance of super-resolution models is noise conditioning augmentation.

  • The noise level added to the image is tracked and fed into the super-resolution models as part of the conditioning.
  • This helps the models handle any artifacts or imperfections introduced in the earlier lower-resolution stages.

Augmentation with Gaussian Noise

Imagen applies Gaussian noise to the low-resolution images (like ) to corrupt them during training, simulating the noisy images that the model will denoise in real-world scenarios.

The noise is controlled using a parameter called aug_level, which specifies the strength of the noise added.

  • aug_level : It defines how much corruption (or noise) is applied to the image.
  • In training, the value of aug_level is chosen randomly, which helps the model generalize across different noise conditions.
  • During inference, different levels of aug_level are tested to determine which produces the best image quality.