I'm trying to create a generative neural network that can offer "basic sum" mathematical solutions using the MNIST dataset from a conditional input.
I've curated a dataset of MNIST examples ranging from 0 to 3, and arbitrarily combined them to create a dataset of 100000 RGB images made up of a combination of three numbers in each of the RGB channels (see below, ignore the false colour). Each combination is labelled by its sum.
This ground truth image would have the label 4.
My goal is to be able to request a new "combination" of hand-drawn images; for example, if I request 4 the network should be able to provide any combination of:
Stacked as an RGB image. There could also be permutations of these combinations (i.e. in any RGB order).
So far I've got something roughly working using a U-NET GAN with a CBAM (channel attention) + self-attention module in both resnet blocks of the generator and in the discriminator; however, the images usually look poor, and it's largely because of the poor quality that the success rate isn't that high (see example below).
(Labels requested 4 and 0, respectively).
My question is, have I got my thinking all wrong? Since the individual pixel values "matter", should I instead be treating the individual images as different feature embeddings and running some cross attention between the images, which are treated individually in three U-Net generators? Rather than trying to use channel attention. Any advice anyone has on this would be greatly appreciated.