I'm trying to understand the concept of Variational Inference for BNNs. My source is this work. The aim is to minimize the divergence between the approx. distribution and the true posterior
$$\text{KL}(q_{\theta}(w)||p(w|D) = \int q_{\theta}(w) \ log \frac{q_{\theta}(w)}{p(w \mid D)} \ dw$$
This can be expanded out as $$- F[q_{\theta}] + \log \ p(D)$$ where $$F[q_{\theta}] = -\text{KL}(q_{\theta}(w) || p(w)) + E[\log p(D \mid w)]$$
Because $log \ p(D)$ does not contain any variational parameters, the derivative will be zero. I really would like to summarize the concept of VI in words.
How can one explain the last formula in words intuitive and with it on the fact that one approximates a function without really knowing it / able to compute it?
My attempt would be: Minimizing the KL between the approximate distribution and the true posterior boils down in minimizing the KL between the approximate distribution and the prior (?) and maximizing the log-likelihood that the parameters of the approximate distribution resulted in the data. Is this somehow correct?