This article is a simplified interpretation of DDPM .
The general target of diffusion models is to learn a real world distribution p ( x ) . DDPM chooses a step-by-step Markov Chain to do this
p θ ( x 0 ) = ∫ p θ ( x 0 : T ) d x 1 : T
where we set p ( x T ) = N ( x T ∣ 0 , I ) , and
p θ ( x 0 : T ) = p θ ( x T ) t = 1 ∏ T p θ ( x t − 1 ∣ x t )
We don't know what is p θ ( x t − 1 ∣ x t ) , so we apply Bayes' Theorem ,
p θ ( x t − 1 ∣ x t ) = p θ ( x t ) p θ ( x t ∣ x t − 1 ) p θ ( x t − 1 )
Recall the definition of the transition distribution
p θ ( x t ∣ x t − 1 ) = N ( x t ∣ α t x t − 1 , ( 1 − α t ) I )
So we therefore can conclude p θ ( x t − 1 ∣ x 0 ) and p θ ( x t ∣ x 0 ) , than we would have
p θ ( x t − 1 ∣ x t , x 0 ) = p θ ( x t ∣ x 0 ) p θ ( x t ∣ x t − 1 ) p θ ( x t − 1 ∣ x 0 ) = N ( x t ∣ α ˉ t x 0 , ( 1 − α ˉ t ) I ) N ( x t ∣ α t x t − 1 , ( 1 − α t ) I ) N ( x t − 1 ∣ α ˉ t − 1 x t − 1 , ( 1 − α ˉ t − 1 ) I ) = N ( x t − 1 ∣ μ ( x t , x 0 ) , σ t 2 I )
where
μ ( x t , x 0 ) σ t 2 = 1 − α t ( 1 − α t − 1 ) α t x t + 1 − α t ( 1 − α t ) α t − 1 x 0 = 1 − α t ( 1 − α t ) ( 1 − α t − 1 )
However, we want p θ ( x t − 1 ∣ x t ) other than p θ ( x t − 1 ∣ x t , x 0 ) . To solve this, we train a model x ^ = x ^ θ ( x t ) to predict x 0 from x t , so that the probability would only depend on x t . Therefore
p θ ( x t − 1 ∣ x t ) ≈ p θ ( x t − 1 ∣ x t , x 0 = x ^ θ ( x t )) = N ( x t − 1 ∣ μ ( x t , x ^ θ ( x t )) , σ t 2 I )
This is what denoising means.