CLIP Guidance
Similar to Classifier Guidance, while using CLIP to calculate the similarity between texts and images instead of a classifier.
The denoise step of DDPM with CLIP Guidance
∇logp(xt∣c)=∇logp(xt)+γ∇(f(xt)⋅g(c))
where
- c is the guidance condition (text)
- f(⋅) and g(⋅) are encoders to embed images and texts into the same space so that we can use dot product to calculate similarity
Pseudo-code
where compute_clip_direction
can be implemented as
GLIDE