0

I'm trying to create a generative neural network that can offer "basic sum" mathematical solutions using the MNIST dataset from a conditional input.

I've curated a dataset of MNIST examples ranging from 0 to 3, and arbitrarily combined them to create a dataset of 100000 RGB images made up of a combination of three numbers in each of the RGB channels (see below, ignore the false colour). Each combination is labelled by its sum.

The sum of these gives 4

This ground truth image would have the label 4.

My goal is to be able to request a new "combination" of hand-drawn images; for example, if I request 4 the network should be able to provide any combination of:

Sum to 4 ex. 1 Sum to 4 ex. 2 Sum to 4 ex. 3

Stacked as an RGB image. There could also be permutations of these combinations (i.e. in any RGB order).

So far I've got something roughly working using a U-NET GAN with a CBAM (channel attention) + self-attention module in both resnet blocks of the generator and in the discriminator; however, the images usually look poor, and it's largely because of the poor quality that the success rate isn't that high (see example below).

label 4 generated working label 0 generated not working

(Labels requested 4 and 0, respectively).

My question is, have I got my thinking all wrong? Since the individual pixel values "matter", should I instead be treating the individual images as different feature embeddings and running some cross attention between the images, which are treated individually in three U-Net generators? Rather than trying to use channel attention. Any advice anyone has on this would be greatly appreciated.

Zintho
  • 1
  • 1

0 Answers0