What is the most suitable measure of the distance between two VAE's latent spaces?

Question

The problem I'm trying to solve is as follows.

I have two separate domains, where inputs do not have the same dimensions. However, I want to create a common feature space between both domains using paired inputs (similar inputs from both domains).

My solution is to encode pairs of inputs into a shared latent space using two VAE encoders (one for each domain). To ensure that the latent space is shared, I want to define a similarity metric between the output of both probabilistic encoders.

Let's define the first encoder as $q_\phi$ and the second as $p_\theta$. As for now, I have two main candidates for this role:

KL-divergence : $\text{KL}(p || q)$ (or $\text{KL}(q||p)$), but, since it is not symmetrical, I don't really know which direction is the best.
JS-divergence: symmetrical and normalized, which is nice for a distance metric, but, since it is not as common as KL, I'm not sure.

Other candidates include adversarial loss (a discriminator is tasked to guess from which VAE the latent code is, the goal of both VAE being to maximally confuse it) or mutual information (seen more and more in recent works, but I still don't fully understand it).

My question is: according to you, which loss could work best for my use case? KL or JS? Or other candidates I didn't think of?

-- More context --

My ultimate goal is to use transfer learning between morphologically distinct robots e.g a quadrupedal robot and a bipedal robot. The first step in my current approach is to record trajectories on both robots executing the same task (walking for example). From said trajectories I create paires of similar states (to simplify the problem, I suppose that both robot achieve the same task at the same speed so temporaly aligned states for both robots are paired). Then my goal is to encode these paired states (that doesn't have the same dimension due to the difference in number of joints) into two latents spaces (one for each VAE) such that similar pair of inputs are close in the latents spaces. If I was working with simple autoencoders I would simply minimize the distance in the latent space between paires of inputs such that similar states on both robots maps to the same point in the latent space. But I need the generative capabilities of VAE so instead I would like to make the distributions outputed by the VAEs as close as possible. Make sense ?

Let me try to understand if I understand your current approach. You want to train two VAEs, but you want to keep the latent distributions as close as possible to each other? But then which latent distribution will you use, and for what? In other words, it doesn't seem that your current approach would learn only one "shared feature representation", but you would actually learn two hidden distributions that are close, so, in a way, your two distributions would try to be similar, but they wouldn't be shared. Anyway, it's not clear to me what your ultimate goal is. Why do you want to do this? — nbro, Oct 25 '21 at 14:41
Are you building this as a single model with bifurcated inputs and outputs (coming down to the same latent layer and originating from that same layer) or are you doing something else? — David Hoelzer, Oct 25 '21 at 20:05
Thank you for taking the time to assess my issue. I updated my question for more details on my goal. — Samuel Beaussant, Oct 26 '21 at 09:36

What is the most suitable measure of the distance between two VAE's latent spaces?

0 Answers0