[Answering my own question after 5 months of studying VAE models]
The point of the MMD-VAE or InfoVAE is not exactly to emphasise on the visual quality of generated samples. It is to preserve greater amount of information through the encoding process. The MMD formulation stems from introducing a mutual coefficient factor into the Evidence Lower BOund (ELBO) loss of VAEs. Refer to the paper appendices for full derivation. This formulation improves information content in latent space and provides for more accurate approximation of the true posterior - these results have also been empirically proven in the paper.
However, the InfoVAE uses pixel-wise or element-wise reconstruction loss. An element-wise reconstruction loss is likely to lead to some extent of blurriness inrespective of the prior loss term. On Github, several developers have implemented the InfoVAE model and shown their results. Here is a link to one such implementation whose results I could personally verify.
From my own experimentations, I can say that even though InfoVAE may give better reconstructions for some data, there is still considerable blurriness.
Perceptual similarity metrics may be learned or computed as a static function of the input image. With a learned perceptual loss, VAEs can produce much sharper images. PixelVAE and VAEGAN are well-known models with such implementations. For a static function of the image itself, reconstruction quality will depend on the nature of that function and such a model may not be very useful for all kinds of datasets. Using measures like SSIM, FSIM, we may still end up getting blurred images.