Why would a VAE train much better with batch sizes closer to 1 over batch size of 100+?

Question

I've been training a VAE to reconstruct human names and when I train it on a batch size of 100+ after about 5 hours of training it tends to just output the same thing regardless of the input and I'm using teacher forcing as well. When I use a lower batch size for example 1 it super overfitted and a batch size of 16 tended to give a much better generalization. Is there something about VAEs that would make this happen? Or is it just my specific problem?

score 0 · Answer 1 · answered Mar 12 '23 at 22:29

My response is based on my limited experience with VAE:

Given these networks generate random samples (z) conditioned to x, and then the decoder output (D(z)) is compared with x (||x-D(z)||), if xs in the batch does not replicate some randomness, the network will not be trained properly. In other word, there should not be a direct correspondence between x and D(z) and instead the encoder should present a probability distribution function. I hope it makes sense.

Why would a VAE train much better with batch sizes closer to 1 over batch size of 100+?

1 Answers1