3

I want to know why diffusion models always use U-Net.

In my opinion, they use U-Net because you can see features of different resolutions and skip connection is good to add detail of images. But I am not sure if it is the main reason why they use U-Net.

Are there other reasons they choose U-Net rather than other architectures?

Penguin.jpg
  • 31
  • 1
  • 5

1 Answers1

5

I don't have a definitive answer but I'd state my intuitions anyways:

Diffusion models are highly related to the idea of stacked denoising autoencoders [Kumar et al. (2014)]. Additionally, U-Net-like architectures are a very common architecture for autoencoders on images. Here, I would agree with your intuition that the bottleneck and the skip-connections help with denoising because they provide representations of different granularity.

Another thought is that U-Nets are empirically well-suited for image segmentation problems. Despite the fact that pixels are classified individually, you want the segmentation output to have consistent areas on top of objects in the image. So you kind of turn a very noisy segmentation mask (original image input) into a mask with much less noise (segmented output).

I think the latter is debatable, I'd be happy to hear your thoughts.

Chillston
  • 1,501
  • 5
  • 11
  • Hi, thanks for reply. The "bottleneck" you mean is 1x1 conv or the process of lowering resolution? If it is the former, I saw an [article](https://benanne.github.io/2022/01/31/diffusion.html) that says diffusion model cares about output more than latent representations which I think is true. Maybe this is also one of the reasons why they remove bottleneck layer from architecture? As for image segmentation, because I am not familiar with it, I am sorry that I cannot see the connection between them XD. – Penguin.jpg Sep 25 '22 at 06:05
  • 1
    With bottleneck I mean the decrease in layer size. This forces the encoding to discard some information which should improve denoising properties. The noise cannot be encoded lower dimensions. – Chillston Sep 26 '22 at 11:04
  • 2
    Regarding the image segmentation part: Just look at [this image](https://nanonets.com/blog/content/images/2020/08/koeln00.png) and you know what image segmentation means. What you see there is the model output (colors) superimposed on the model input (a regular image). The goal of the model is to classify every pixel based on which object it belongs to (colors denote object classes). Thus, you could think of this task as the transformation of a very noisy version of this segmentation map (which is the regular image) to a very clean segmentation map. I hope this made it somewhat clearer? – Chillston Sep 26 '22 at 11:06
  • 1
    Yes, it is clearer, thanks for explanation! If that is the case, I think I prefer the former idea. Maybe it is really just a simple reason like this that supports them to use it :o. – Penguin.jpg Sep 26 '22 at 11:19
  • I think stable diffusion uses UNet because diffusion probabilistic models used UNet, DPM used UNet because PixelCNN++ used UNet, PixelCNN++ used it because it was 2017, that's what they had and it was good enough. Deep learning research has always been about cargo cult empirical research with weak (if any) justification. That said, the other choice is stuff like MaskRCNN and RPNs don't make any sense on latent representations – stuart May 14 '23 at 05:45
  • I totally get your point and agree to some extend. I still think it is very fruitful to try to interpret a little bit more into why certain architectures are used in a given framework. This has led to great advances, e.g. the whole area of geometric deep learning is kind of an artifact of identifying the benefits of specific models and casting them into more general versions. – Chillston May 17 '23 at 16:10