It's mentioned here that there is no measure of intra-class diversity with the inception score:
If your generator generates only one image per classifier image class, repeating each image many times, it can score highly (i.e. there is no measure of intra-class diversity)
However, isn't it "easy" to look at the variance of the outputs of the classifier for a given class (e.g. if you only output 0.97 for all the images of your GAN class then there is no intra-class diversity but if you output 0.97, 0.95, 0.99, 0.92, there is diversity?). I'm struggling to understand why this is hard to do (but I might be missing something!).