Reverse Process in Diffusion Model Doesn't Return Original Image

Question

I am attempting to program a Denoising Diffusion Model based on the one introduced in the article by Ho et al. (2020). However, I have run into issues while testing the reverse diffusion process.

Walking through my PyTorch code, I first load an image x, which I normalize, and generate a fixed noise tensor e ($\epsilon$) the same size as x. I also define a noise schedule with 10 steps, where a and d correspond to $\alpha$ and $\overline\alpha$ in the paper. The calls to unsqueeze and view are only to make the dimensions compatible for the computations that follow.

x = read_image("data/in/dog.jpg").unsqueeze(0) / 255 * 2 - 1
e = torch.randn(x.shape)
a = torch.linspace(0.9, 0.5, 9).view(-1, 1, 1, 1, 1)
d = a.cumprod(0)

Then, I gradually apply noise to the original image, concatenating each new sample along the first dimension. This process follows the recommended formula $x_t = \sqrt{\overline\alpha_t} x_0 + \sqrt{1 - \overline\alpha_t} \epsilon$.

for t in range(9):
    x = torch.cat((x, d[t].sqrt() * x[0] + (1 - d[t]).sqrt() * e))

This results in a gradual noising process, as expected.

Next, I attempt to remove the noise using the formula $x_{t-1} = x_t - \frac{\sqrt{1 - \alpha_t}}{\sqrt{\alpha_t}} \epsilon$, derived from $x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon$.

for t in range(8, -1, -1):
    x = torch.cat((x, (x[-1:] - (1 - a[t]).sqrt() * e) / a[t].sqrt()))

This, however, has a strange result. It seems that the image is being denoised up to a certain point, but starts getting noisier thereafter.

I believe that this is due to $\epsilon$ in the formula $x_t = \sqrt{\overline\alpha_t} x_0 + \sqrt{1 - \overline\alpha_t} \epsilon$ not being the same as in $x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon$. However, it seems to make sense that the denoising process would deterministically lead to the original image when the noise $\epsilon$ applied at each time step is known. Why is this not the case?

EDIT: After a lot of fiddling around, I have managed to use the formula $x_{t-1} = \frac{1}{\sqrt{\alpha_t}} x_t + \frac{1 - \overline\alpha_t - \sqrt{(1 - \overline\alpha_t)(\alpha_t - \overline\alpha_t)}}{\sqrt{1 - \overline\alpha_t}\sqrt{\alpha_t}}\epsilon$ to correctly predict the reverse diffusion process. This, however, raises more questions than answers. I notice that the formula resembles that of the mean of the ground truth decoder $q(x_{t-1}\vert x_t, x_0)$, $\mu_q(x_t, x_0)=\frac{1}{\sqrt{\alpha_t}} x_t + \frac{1 - \alpha_t}{\sqrt{1 - \overline\alpha_t}\sqrt{\alpha_t}}\epsilon$, but is not equal to it. What is going on here?

Reverse Process in Diffusion Model Doesn't Return Original Image

0 Answers0