I'm working on research in this sector where my supervisor wants to do cannonicalization of name data using VAEs, but I don't think it's possible to do, but I don't know explicitly how to show it mathematically. I just know empirically that VAEs don't do good on discrete distributions of latents and observed variables(Because in order to do names you need your latent to be the character at each index and it can be any ASCII char, which can only be represented as a distribution). So the setup I'm using is a VAE with 3 autoencoders, for latents, one for first, middle and last name and all of them sample each character of their respective names from the gumbel-softmax distribution(A form a categorical that is differentiable where the parameters is a categorical dist). From what I've seen in the original paper on the simple problem of MNIST digit image generation, the inference and generative network both did worse as latent dimension increased and as you can imagine the latent dimension of my problem is quite large. That's the only real argument for why this can't work, that I have. The other would have been it's on a discrete distribution, but I solved that by using a gumbel softmax dist instead.
This current setup isn't working at all, the name generations are total gibberish and it plateaus really early. Are there any mathematical intuitions or reasons that VAEs won't work on a problem like this?
As a note I've also tried semi-supervised VAEs and it didn't do much better. I even tried it for seq2seq of just first names given a first and it super failed as well and I'm talking like not even close to generation of names or the original input.