Why doesn't the inception score measure intra-class diversity

Question

It's mentioned here that there is no measure of intra-class diversity with the inception score:

If your generator generates only one image per classifier image class, repeating each image many times, it can score highly (i.e. there is no measure of intra-class diversity)

However, isn't it "easy" to look at the variance of the outputs of the classifier for a given class (e.g. if you only output 0.97 for all the images of your GAN class then there is no intra-class diversity but if you output 0.97, 0.95, 0.99, 0.92, there is diversity?). I'm struggling to understand why this is hard to do (but I might be missing something!).

Can you please put your **specific question** in the title? I understand that you're confused about the inception score and how it is related to "intra-class diversity" (whatever that really means), but that is not a question. A question is something like "Why doesn't the inception score ...?" — nbro, May 06 '22 at 10:11

score 4 · Accepted Answer · answered May 06 '22 at 14:30

For reference, a recap of Inception Score: The inception score is computed by comparing the categorical output distributions of an inception model, given examples from real vs synthetic images. If the synthetic images produce similar class distributions as the real images, the inception score is high, otherwise it is low.

However, isn't it "easy" to look at the variance of the outputs of the classifier for a given class

Say you want to generate multiple horses and the model learns to generate horses with different colors but always in the same pose - then your class probabilities will vary, but I wouldn't call this very diverse horse generation. This is how I would understand what is meant by your cited statement.

The output distributions from the inception model contain class information but very little information of specific image features. Thus, the inception score cannot be sensitive to intra-class variations of the generator.

"multiple horses and the model learns to generate horses with different colors but always in the same pose", yes you're right, I haven't thought of that, variance of the output's classifier would then vary but there isn't as much variability as we'd expect — FluidMechanics Potential Flows, May 07 '22 at 12:30

score 3 · Answer 2 · answered May 06 '22 at 14:55

3

Adding on top of Chillston answer:

regarding the variance, it is unfortunately not so straight forward. The problem being that most deep learning models are not calibrated, hence small intra-class variation might lead to large probability variations for the winning class. Maybe one way to account for this issue would be to compute the mutual information between the generated predictions and a prior expected distribution, for example uniform probability distribution 1/n with n number of expected difference within class, like horse poses to use the same example as Chillstone, but I found no reference about similar attempts, plus coming up with a proper prior expected distribution doesn't sound trivial at all. I guess the reason is that the inception score was designed for generic GANs (i.e. GANs trained to generate generic classes from CIFAR and similar dataset) so without having in mind to measure variability within classes.

answered May 06 '22 at 14:55

Edoardo Guerriero

5,153
1
11
25

1

That's a good point - Inception is trained to be invariant to intra-class variations, so to capture this a different model or at least a differently trained inception model might be better suited. Maybe something that identifies data transformations like Capsule Net – Chillston May 06 '22 at 16:00
"small intra-class variation might lead to large probability variations for the winning class" is a weird concept to me, but I've seen this several times. Isn't the function output = f(input) continuous with regards to the input? – FluidMechanics Potential Flows May 07 '22 at 12:31
not sure if I'm missing something, but the continuity of the final activation function doesn't tell you much about the properties of the function learned by a classifier. – Edoardo Guerriero May 07 '22 at 13:31
@EdoardoGuerriero All the functions are continuous right? The last activation function is, the one to get from layer n-1 to n is too, etc.? – FluidMechanics Potential Flows May 08 '22 at 18:48
1

sure, but again continuity doesn't imply strong constrains. The final mapping learn by a classifier is highly non linear, hence it's not surprising (even though undesired in many situations) that small changes can lad to big differences in predictions. – Edoardo Guerriero May 08 '22 at 20:46
Fair enough, isn't there some way to force the behaviour: small changes in the input -> small changes in the output? By integrating this into the loss or maybe by training n networks and picking the one that shows the behaviour the most out of the n networks if they all converge to similar but not same local minima? – FluidMechanics Potential Flows May 08 '22 at 21:39
1

the problem here is how to numerically define "small change". A topic that kinda fit in this discussion but limited to CNN is [shift invariance](https://arxiv.org/abs/2011.14214), in which people try to solve the issue of CNN misclassifying images that are shifted by a few or just one pixel in one direction. That's already a way to force a model to behave constantly with respect to a predefined small change. – Edoardo Guerriero May 09 '22 at 07:12
another problem, a more conceptual one, is that in general we do want to learn some big changes for small input differences. It's the trade off between generalization and overfitting. So I guess that what would be required to train such a a model is actually a proper dataset. And indeed a custom loss, with an extra error component that compare for example the probability returned for a class with the expected probability we want to observe for each element belonging to that class. – Edoardo Guerriero May 09 '22 at 07:17
2

Just adding to some comments, I would think that in general, a classification model might not be the best type of model to use for intra-class variations, because its task is to be invariant to those. It always output the same result even though two horses look very different. As a general idea, maybe using a VAE model would be more suitable, as these models are build to capture as much detail as possible, right? Still experiments would have to show if this is feasible to use as a score metric like the inception model. – Chillston May 09 '22 at 09:38
Thanks everyone! – FluidMechanics Potential Flows May 09 '22 at 16:17

Why doesn't the inception score measure intra-class diversity

2 Answers2