Most online sources recommend using versions of the Inception score to evaluate the synthetic images generated by a GAN. These scores are pre-trained on the InceptionV3 model. Does this mean that images need to have similar properties to those in ImageNet? My images have only one channel and are images of climate data so have very different properties. What is the best way to evaluate GAN-generated imagery for non-photographic data?
1 Answers
Consider that the problem of evaluating (or measuring) the quality of generated images, can be expressed in terms of texture quality and/or perceptual quality.
- The texture quality measures how close the generated texture is to the original one. Intuitively, measuring textures is equal to measuring pixel-level differences. For example you can achieve this by a mean squared error (MSE) over the single pixels.
- Perceptual metrics look instead at the (global) meaning of the image rather than at its individual pixel values. Therefore, perceptual quality aims to measure the agreement in semantics of two images, ideally reflecting the human understanding of them.
Actually there is a trade-off between texture and perceptual quality, meaning that the same method can't maximize both. In general, VAEs are said to generate better looking textures while GANs to provide images with way better semantics: in other words, GANs generally generate plausible images but with inaccurate textures - indeed, for very big models the errors tend to narrow.
Does this mean that images need to have similar properties to those in ImageNet?
Well, the Inception score is defined from the activations of a pre-trained Inception-V3 so you're relying on what the model has learned. I mean it should be expected to better rank natural images rather than something that is radically different and never seen during training. As an alternative score you can look at the Frechet Inception Distance (FID) which also considers the distribution of the real images, which is expected to provide better evaluations in general.
What is the best way to evaluate GAN-generated imagery for non-photographic data?
It can be hard to answer but you can also try image processing metrics like the PSNR or the SSIM, as alternatives to IS and/or FID. These work for general kind of images, and you may want to try both in order to understand which better aligns with your notion of "better" quality. In general, the SSIM should be superior to both MSE and PSNR since it's designed to better match the human perception of images.

- 2,120
- 2
- 13
-
1Maybe it was implied in this answer, but one could in theory finetune (or train from scratch) an Inception or shallower model on some task (self-supervised task, if no label are available), and then use it to calculate IS. It can be expensive indeed, but it is still a solution. I do not also exclude that one could find a pretrained model on similar data, e.g satellite imagery. – Ciodar Jun 23 '23 at 13:11