Once again, γ is a term that controls how much our learned conditional model cares about the conditioning information.
Another example of where the condition is text
In the U-Net architecture, for examples, the text embeddings are typically concatenated with the input features at certain stages (e.g., the bottleneck or skip connections). This allows the U-Net to leverage the contextual information provided by the text when making predictions.
During training, Classifier-Free Guidance requires training two models: one is an unconditional generation model, and the other is a conditional generation model. However, these two models can be represented by the same model, and during training, it is only necessary to randomly set the condition to null with a certain probability.
During inference, the final result can be obtained by linear extrapolation between conditional and unconditional generation. The quality of the generated samples can be adjusted through the guidance coefficient, balancing the realism and diversity of the generated samples.