The problem I'm trying to solve is as follows.
I have two separate domains, where inputs do not have the same dimensions. However, I want to create a common feature space between both domains using paired inputs (similar inputs from both domains).
My solution is to encode pairs of inputs into a shared latent space using two VAE encoders (one for each domain). To ensure that the latent space is shared, I want to define a similarity metric between the output of both probabilistic encoders.
Let's define the first encoder as $q_\phi$ and the second as $p_\theta$. As for now, I have two main candidates for this role:
KL-divergence : $\text{KL}(p || q)$ (or $\text{KL}(q||p)$), but, since it is not symmetrical, I don't really know which direction is the best.
JS-divergence: symmetrical and normalized, which is nice for a distance metric, but, since it is not as common as KL, I'm not sure.
Other candidates include adversarial loss (a discriminator is tasked to guess from which VAE the latent code is, the goal of both VAE being to maximally confuse it) or mutual information (seen more and more in recent works, but I still don't fully understand it).
My question is: according to you, which loss could work best for my use case? KL or JS? Or other candidates I didn't think of?
-- More context --
My ultimate goal is to use transfer learning between morphologically distinct robots e.g a quadrupedal robot and a bipedal robot. The first step in my current approach is to record trajectories on both robots executing the same task (walking for example). From said trajectories I create paires of similar states (to simplify the problem, I suppose that both robot achieve the same task at the same speed so temporaly aligned states for both robots are paired). Then my goal is to encode these paired states (that doesn't have the same dimension due to the difference in number of joints) into two latents spaces (one for each VAE) such that similar pair of inputs are close in the latents spaces. If I was working with simple autoencoders I would simply minimize the distance in the latent space between paires of inputs such that similar states on both robots maps to the same point in the latent space. But I need the generative capabilities of VAE so instead I would like to make the distributions outputed by the VAEs as close as possible. Make sense ?