2

Problem setting

We have to do a binary classification of data given a training dataset $D$, where most items belong to class $A$ and some items belong to class $B$, so the classes are heavily imbalanced.

Approach

We wanted to use a GAN to produce more samples of class $B$, so that our final classification model has a nearly balanced set to train.

Problem

Let's say that the data from both classes $A$ and $B$ are very similar. Given that we want to produce synthetic data with class $B$ with the GAN, we feed real $B$ samples that we have into the discriminator alongside with generated samples. However, $A$ and $B$ are similar. It could happen that the generator produces an item $x$, that would naturally belong to class $A$. But since the discriminator has never seen class-$A$ items before and both classes are very close, the discriminator could say that this item $x$ is part of the original data that was fed into the discriminator. So, the generator successfully fooled the discriminator in believing that an item $x$ is part of the original data of class $B$, while $x$ is actually part of class $A$.

If the GAN keeps producing items like this, the produced data is useless, since it would add heavy noise to the original data, if combined.

At the same time, let's say before we start training the generator, we show the discriminator our classes $A$ and $B$ samples while giving information, that the class-$A$ items are not part of class $B$ (through backprop). The discriminator would learn to reject class-$A$ items that are fed to it. But wouldn't this mean that the discriminator has just become the classification model we wanted to build in the first place to distinguish between class $A$ and class $B$?

Do you know any solution to the above-stated problem or can you refer to some paper/other posts on this?

nbro
  • 39,006
  • 12
  • 98
  • 176
frederik
  • 23
  • 4
  • 1
    You cannot really use GANs like this to "bootstrap" or augment a dataset for supervised learning. The quality of your generated samples is just as limited by the lack of data as just training direct on the supervised learning set - really it is no different in concept to using the labels from one supervised learner to train another. Or mounting a fan on a sailing boat to blow into the sail . . . – Neil Slater Oct 27 '18 at 16:57
  • @NeilSlater Do you think it is possible to use Conditional GANs as proposed by Mirza and Osindero? – frederik Oct 28 '18 at 01:17
  • I doubt it is possible, but I have not read the work on Conditional GANs. I suspect that about the only thing you would achieve is some regularisation (which may help prevent over-fitting), but other regularisation techniques would be easier to implement than using generative models in your case IMO. If you can find some way to inject and use far more data into your generative model than you have available for the supervised model, then *that* might change things for you, and make it worthwhile. – Neil Slater Oct 28 '18 at 09:26

1 Answers1

1

In my experience, GANs work really well for the scenario of semi-supervised learning, where you don't necessarily have labels for all your class $B$ data, but you do have a balanced dataset. In my (limited) experience, you do have to have a balanced (in numbers) set of $A$ and $B$ objects, even though you are not sure of the labels.

And yes, GANs can overfit to outliers as well, especially in the absence of a lot of examples, so be cautious.

Currently, the version that works best for me (in terms of GANs) is WGAN-GP or WGAN-LP in combination with Optimistic Mirror Descent Adam (here, Ncritic/Nactor = 1). Take a look at the paper by Adiwardana et al., especially Fig. 7 (astonishing!), for semi-supervised learning with a limited number of class labels.

nbro
  • 39,006
  • 12
  • 98
  • 176
Foivos
  • 176
  • 4