What is the intuition behind variational inference for Bayesian neural networks?

Question

I'm trying to understand the concept of Variational Inference for BNNs. My source is this work. The aim is to minimize the divergence between the approx. distribution and the true posterior

$$\text{KL}(q_{\theta}(w)||p(w|D) = \int q_{\theta}(w) \ log \frac{q_{\theta}(w)}{p(w \mid D)} \ dw$$

This can be expanded out as $$- F[q_{\theta}] + \log \ p(D)$$ where $$F[q_{\theta}] = -\text{KL}(q_{\theta}(w) || p(w)) + E[\log p(D \mid w)]$$

Because $log \ p(D)$ does not contain any variational parameters, the derivative will be zero. I really would like to summarize the concept of VI in words.

How can one explain the last formula in words intuitive and with it on the fact that one approximates a function without really knowing it / able to compute it?

My attempt would be: Minimizing the KL between the approximate distribution and the true posterior boils down in minimizing the KL between the approximate distribution and the prior (?) and maximizing the log-likelihood that the parameters of the approximate distribution resulted in the data. Is this somehow correct?

nbro · Accepted Answer · 2021-03-22T14:05:47.067

Your description of what is going on is more or less correct, although I am not completely sure that you have really understood it, given your last question.

So, let me enumerate the steps.

The computation of the posterior is often intractable (given that the evidence, i.e. the denominator of the right-hand side of the Bayes' rule, might be numerically expensive to approximate/compute or there's no closed-form solution)
To address this intractability, you cast the Bayesian inference problem (i.e. the application of Bayes' rule) as an optimization problem
1. You assume that you can approximate the posterior with another simpler distribution (e.g. a Gaussian), known as the variational distribution
2. You formulate this optimization problem as the minimization of some notion of distance (e.g. the KL divergence) between the posterior and the VD
3. However, the KL divergence between the posterior and the VD turns out to be intractable too, given that, if you expand it, you will find out that there's still an evidence term
4. Therefore, you use a tractable surrogate (i.e. equivalent, up to some constant) objective function, which is known as the evidence lower bound (ELBO) (which is sometimes known as the variational free energy), which is the sum of 2 terms
  1. KL divergence between the VD and the prior
  2. the likelihood of the parameters given the data

To address your last doubt/question, the ELBO does not contain the posterior (i.e. what you really want to find), but only the variational distribution (you choose this!), the prior (which you also define/choose), and the likelihood (which, in practice, corresponds to the typical usage of the cross-entropy; so the only thing that you need more, with respect to the traditional neural networks, is the computation of the KL divergence): in other words, you originally formulate the problem as the minimization of the KL divergence between the posterior and the VD, but this is just a formulation.

thanks! So in fact instead of using a point estimator or the fully bayesian approach, you still want to have a distribution (and not point estimates) but instead of applying the true bayesian inference, you formulate the task as an optimization. The intuitive explanation of the free energy term might then be s.th. like: we of course want to increase the likelihood of the parameters given the data but at the same time want to stay close to the priors? Optimizing this term gives us a good approx. of the true posterior — f_3464gh, Mar 22 '21 at 11:16
@manu Yes, exactly, I would say that's the typical intuition interpretation of what is going on when you optimize the ELBO. — nbro, Mar 22 '21 at 11:18
Already, thanks so much! Just one more question - you said that the ELBO is equivalent to the KL up to some constant. What exactly is the constant? or does the marginal likelihood that becomes by calculating derivative refers to this constant? — f_3464gh, Mar 22 '21 at 11:25
The constant is the (log of the) evidence, as you had said in the original version of the post. The ELBO is **equal to** (the **negative** of the) "KL divergence between the posterior and the VD" + "the evidence": that's why **maximizing** the ELBO is equivalent to the original optimization problem of **minimizing** the "KL divergence between the posterior and the VD". As you also had said, the evidence is a constant with respect to the **variational parameters**, i.e. the mean and variance of the VD, which is what you want to find. — nbro, Mar 22 '21 at 11:29

What is the intuition behind variational inference for Bayesian neural networks?

1 Answers1