6

I'm reading this interesting blog post explaining diffusion probabilistic models and trying to understand the following.

In order to compute the reverse process, we need to consider the posterior distribution $q(\textbf{x}_{t-1} | \textbf{x}_t)$ which is said to be intractable*

because it needs to use the entire dataset and therefore we need to learn a model $p_\theta$ to approximate these conditional probabilities in order to run the reverse diffusion process.

If we use Bayes theorem we have

$$q(\textbf{x}_{t-1} | \textbf{x}_t) = \frac{q(\textbf{x}_t |\textbf{x}_{t-1})q(\textbf{x}_{t-1})}{q(\textbf{x}_t)}$$

I understand that indeed we don't have any prior knowledge of $q(\textbf{x}_{t-1})$ or $q(\textbf{x}_t)$ since this would mean already having the distribution we are trying to estimate. Is this correct?

The above posterior becomes tractable when conditioned on $\textbf{x}_0$ and we obtain

$$q(\textbf{x}_{t-1} | \textbf{x}_t , \textbf{x}_0) = \mathcal{N}(\tilde{\bf{\mu}}(\textbf{x}_t , \textbf{x}_0) \, , \, \tilde{\beta}_t \textbf{I})$$

So, apparently, we obtain a posterior that can be calculated in closed form when we condition on the original data $\textbf{x}_0$. At this point, I don't understand the role of the model $p_\theta$ : why do we need to tune the parameters of a model if we can already obtain our posterior?

nbro
  • 39,006
  • 12
  • 98
  • 176
James Arten
  • 297
  • 1
  • 8
  • 2
    Don't know much about diffusion models, but I think this iterative conversion of $\mathbf{x}_0$ to a gaussian is only being done to then train a model to reverse the process. So in my understanding, the whole idea is to not use $\mathbf{x}_0$, otherwise, why would you want to add noise to $\mathbf{x}_0$ in the first place? This might be comparable to the encoding and decoding process of variational autoencoders. The encoding makes the input noisier and the decoder reconstructs the input. Here, you only condition the encoder on $\mathbf{x}_0$, but not the decoder directly. – Chillston Jul 03 '22 at 10:53
  • 1
    @Chillston Thank you greatly for your answer! So when we are optimizing the lower bound on the negative log-likelihood even there we are not using $\mathbf{x}_0$ correct? Since I have you here maybe you can also tell me the intuition behind setting up a variational inference problem with a fixed approximate posterior, I thought the whole point of variational inference was to allow the posterior to belong to a big enough (well-behaved) class so as to get as close as possible to the true posterior in an efficient way. – Monolite Jul 03 '22 at 12:10
  • I hope I understand you right, I really didn't look into diffusion models too deep, yet. How I understand this is: You ultimately want a model that denoises a given random sample and maps it onto the manifold of your data. Because it is not feasible to model $p_{\Theta} : X_T \rightarrow X_0$ directly, the model learns a single denoising step $p_{\Theta} : X_{t} \rightarrow X_{t+1}$ ($0 \leq t \leq T$). By iteratively applying the model you arrive at $X_0$ after $T$ iterations. Thus given a noisy sample $x_t$, the model predicts the noise that was added to $\mathbf{x}_{t-1}$. – Chillston Jul 03 '22 at 15:36
  • So the (simplified) training objective is actually the MSE between predicted and actual noise (s. Equation 14 [Ho et al.](https://arxiv.org/pdf/2006.11239.pdf)). So I'd say you need $\mathbf{x}_0$ to optimize the variational bound. – Chillston Jul 03 '22 at 15:37
  • Regarding the second part: In my intuition, having a fixed posterior doesn't limit diversity in the case of denoising models (if that's what you mean), because you never specifically train the model to map $\mathbf{x}_T$ to $\mathbf{x}_0$. Instead, you are training it to slightly improve the quality of corrupted samples. This denoising itself is stochastic. Therefore over multiple denoising steps, the outcome can vary greatly. For an example, see Figure 7 in the [Ho et al. paper](https://arxiv.org/pdf/2006.11239.pdf). – Chillston Jul 03 '22 at 15:38
  • @Chillston let me pick your brain a bit more, is $q(\textbf{x}_{t-1} | \textbf{x}_t , \textbf{x}_0) = \mathcal{N}(\tilde{\bf{\mu}}(\textbf{x}_t , \textbf{x}_0) \, , \, \tilde{\beta}_t \textbf{I})$ (the posteriori we could obtain conditioning on $\textbf{x}_0$) the best possible reverse process we could hope to achieve or can we somehow learn something better? It seems to me we can't hope to learn a $p_{\Theta}( \textbf{x}_{t-1} | \textbf{x}_t) )$ that is better. – Monolite Jul 03 '22 at 18:30
  • @Monolite Hmm, the way I understand it is that calculating the posterior $q(\mathbf{x}_{t-1} | \mathbf{x}_t)$ is infeasible because you cannot *guess* the noise $\epsilon$ of the term $\mathbf{x}_{t-1} + \epsilon = \mathbf{x}_{t}$. But when you condition on $\mathbf{x}_{0}$ then you drastically narrow down the distribution of $\epsilon$. This reduces a lot of the variance of the prior guess. – Chillston Jul 04 '22 at 12:04
  • However, as the blog author also states, conditioning on $\mathbf{x}_0$ will result in reconstructing $\mathbf{x}_0$, which is close to the behavior of a VAE. So if you only want to reconstruct, then this might be the best thing to do. If you want to generate data from the manifold, it is not the best option (in my understanding). – Chillston Jul 04 '22 at 12:06

2 Answers2

2

I am also learning diffusion models and would like to give some information.

At this point, I don't understand the role of the model $p_\theta$

To clear a bit: $p_\theta$ is just another annotation for U-net and the role is receiving ($x_t$,$t$) (sometimes also receives classifier $y$) and predicts $x_0$ OR $x_{t-1}$ depending on different papers. So at the end of the day, to synthesize new data, given a noisy (usually Gaussian) image, U-net can iteratively predict $x_0$ better - check out algorithm 2 in the DDPM paper (2020).

Your question about the posterior might be answered in more detail here: Diffusion Models | Paper Explanation | Math Explained - YouTube

Check video time around 18:00 that explains a bit more information regarding $x_0$ guided process in the optimization of lower boundary.

lqi
  • 21
  • 3
  • Why is noise added at each timestamp during the reverse process? – Nathan B Jan 24 '23 at 09:15
  • @NathanB I am assuming you mean the DDIM diffusion way. DDIM is predicting x_0 and adding the noise back to x_t-1 during reverse. From my understanding, this strategy is turning the diffusion process from markovian to non-markovian. The benefits of DDIM include faster generation, better sampling strategy, etc. I would recommend the cold diffusion paper and the original DDIM paper for a deeper understanding. – lqi Jan 25 '23 at 17:34
  • What will happen if you don't add some noise in the reverse process at every step? After all, you start from a random noise image, so you already have the randomness needed for the sampling, why add more noise every step backwards? – Nathan B Jan 26 '23 at 11:32
  • 1
    @NathanB you can do it without adding noise during sampling. It's just different sampling strategies. In the original DDPM paper, they are not adding noise when sampling backward. – lqi Jan 27 '23 at 18:48
  • Is the purpose to basically add more 'randomness' to the process so that the result will be more 'creative'? – Nathan B Jan 29 '23 at 11:22
  • 1
    @NathanB I can't give you the conclusion that it's adding more "randomness" to be more "creative" as I haven't fully researched it enough. However, the main purpose includes changing non-markovian to markovian, which allows faster generation. Different sampling strategies have different pros and cons, I highly suggest reading papers like cold diffusion, stable diffusion, ILVR, etc. to get a deeper intuition of those sampling strategies. – lqi Jan 30 '23 at 02:20
  • Still can't find an answer to the question: Why in the DDIM a noise is added every step of the reverse process, and if it makes it better. – Nathan B Apr 23 '23 at 10:36
0

You do not yet have $\mathbf{x}_0$ during sampling (not training). That's why you need to approximate $q(\mathbf{x}_{t−1}|\mathbf{x}_t, \mathbf{x}_0)$ with $p_{\theta}(\mathbf{x}_{t−1}|\mathbf{x}_t)$ via variational inference such as KL divergence. After training with good data, this should produce an approximation of $\mathbf{x}_0$.

Eureka Zheng
  • 101
  • 2