What's the architecture that allows the generation of new images based on input image in tools like Midjourney?

Question

I understand that the high-level architecture of tools like Midjourney use diffusion models to generate images from text. What I don't understand is which type of network architecture allow the second step of their workflow - generating new, similar images conditioned on a selected input image. This post seems to mention ControlNet as a way to condition image output based on image input, but it seems like ControlNet would require additional training to yield useful results, so I don't see how that would work in a Midjourney-like workflow.

Any links to relevant publications or blog articles would help as well.

Neil Slater · Accepted Answer · 2023-07-18T15:56:38.443

Midjourney (and DALL-E 2 I think) uses a concept vector (or "embedding") to condition its image outputs, which can be produced in at least two ways:

By summarising text input
By converting an image

The concept space is the same for these two processes, although the input stages of the analysis are different architectures (an RNN for text, a CNN for images). So the results of multiple text prompts and image prompts can be combined using a weighted mean, and this is how multi-prompts plus image weighting work. Everything goes into the same embedding vector that is used to condition the diffusion reversal functions.

The CLIP model by OpenAI is an example architecture which does this. It is trained with image and caption pairs to produce a pair of models which can convert either text or image inputs into a shared embedding or "concept space". Midjourney is likely using their own version of CLIP, perhaps trained on their own dataset, or something very similar - their models are proprietry though, so this is an educated guess.

Note this model is separate to the diffusion model, and helps produce inputs to it. When you start a generation with multiple image and text prompts:

Each image prompt is passed to the image encoder, and a mean of all the image output vectors taken
Each text sub-prompt is passed to the text encoder, and a mean of all the text output vectors taken
The mean image and mean text vectors are combined, weighted by the --iw parameter. This is done once, before the main image generation starts
The final combined vector is used to condition the steps in the diffusion denoising image generator

ControlNet is an entirely different way to influence the diffusion denoising function. Midjourney, as of version 5.2, does not have any equivalent.

The Midjourney base process also does not use img2img - which is another way to use an input image. This uses a partially-noised version of the image as a starting point, instead of pure noise. The Midjourney variations, zoom and pan features do use this kind of influence, so it appears the raw model does support the feature, but Midjourney service does not allow customers to control it in full, for instance by uploading their own images, because of potential for abuse.

What's the architecture that allows the generation of new images based on input image in tools like Midjourney?

1 Answers1