Large Guidance Weight Samplers
In Classifier-Free Guidance, increasing the classifier-free guidance weight improves image-text alignment, but damages image fidelity producing highly saturated and unnatural images. This is due to a train-test mismatch arising from high guidance weights.
At each sampling step , the -prediction must be within the same bounds as training data , i.e. within , but empirically that high guidance weights cause -predictions to exceed these bounds.
Static Thresholding
Elementwise clipping the -prediction to
Dynamic thresholding
At each sampling step set to a certain percentile absolute value in , and if , then threshold to the range and then divide by .
Robust Cascaded Diffusion Models
Imagen starts with a 64x64 image generated by a base model, two super-resolution diffusion models are then used to progressively upscale this image:
- First step: Upsampling the 64x64 image to 256x256.
- Second step: Further upsampling the 256x256 image to 1024x1024.
These super-resolution models take the low-resolution image and use text-conditional inputs to generate higher resolutions while preserving quality and coherence with the text description.
Noise Conditioning Augmentation
A key aspect that enhances the performance of super-resolution models is noise conditioning augmentation.
- The noise level added to the image is tracked and fed into the super-resolution models as part of the conditioning.
- This helps the models handle any artifacts or imperfections introduced in the earlier lower-resolution stages.
Augmentation with Gaussian Noise
Imagen applies Gaussian noise to the low-resolution images (like ) to corrupt them during training, simulating the noisy images that the model will denoise in real-world scenarios.
The noise is controlled using a parameter called aug_level
, which specifies the strength of the noise added.
aug_level
: It defines how much corruption (or noise) is applied to the image.- In training, the value of
aug_level
is chosen randomly, which helps the model generalize across different noise conditions. - During inference, different levels of
aug_level
are tested to determine which produces the best image quality.