I don't have a definitive answer but I'd state my intuitions anyways:
Diffusion models are highly related to the idea of stacked denoising autoencoders [Kumar et al. (2014)]. Additionally, U-Net-like architectures are a very common architecture for autoencoders on images.
Here, I would agree with your intuition that the bottleneck and the skip-connections help with denoising because they provide representations of different granularity.
Another thought is that U-Nets are empirically well-suited for image segmentation problems. Despite the fact that pixels are classified individually, you want the segmentation output to have consistent areas on top of objects in the image. So you kind of turn a very noisy segmentation mask (original image input) into a mask with much less noise (segmented output).
I think the latter is debatable, I'd be happy to hear your thoughts.