Why does this formula $\sigma^2 + \frac{1}{T}\sum_{t=1}^Tf^{\hat{W_t}}(x)^Tf^{\hat{W_t}}(x_t)-E(y)^TE(y)$ approximate the variance?

Question

How does:
$$\text{Var}(y) \approx \sigma^2 + \frac{1}{T}\sum_{t=1}^Tf^{\hat{W_t}}(x)^Tf^{\hat{W_t}}(x_t)-E(y)^TE(y)$$ approximate variance?

I'm currently reading What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision, and the authors wrote the above formula for the approximate estimation for the variance. I'm confused how the above is an approximation for $\frac{\sum(y-\bar{y})^2}{N-1}$. So, in the above equation, they're using a Bayesian Neural Network to quantify uncertainty. $\sigma$ is the predictive variance (kind of confused how they get this). $x$ is the input and $y$ is the label for the classification. $f^{\hat{W_t}}(\cdot)$ output a mean to a Gaussian distribution, with $\sigma$ being the SD for that distribution and $T$ is a predefined number of samples because the gradient is evaluated using Monte Carlo sampling.

Can you add the dimensions of all the matrices here? It kind of seems a matrix+real number addition. — , Feb 20 '21 at 05:14
@DuttaA sadly they don't give dimensions. I think the reason being you can adjust the model size to fit whatever you need, so there's no strict requirement in terms of dimensionality. If you like you can just think of them as vectors I think that's reasonable. — user8714896, Feb 20 '21 at 05:15
Hmm,,,the notations then kind of don't make sense...the 1st term is a real..second term is a matrix and the 3rd term is real again. — , Feb 20 '21 at 05:16
The 2nd term is not a dot/inner product, it is something called the outer product and its trace will give you the inner product (you can check using 2 1D matrices) — , Feb 20 '21 at 05:20
it's actually a little confusing I'm looking at the paper right now. They denote $T$ as MC samples, but then they also use the same style of $T$ over $f()$ and $E(Y)$, I assume it's transpose, but maybe they actually are putting real values to the power of $T$? That doesn't really make sense though. — user8714896, Feb 20 '21 at 05:24
If this is from equation (4), you've got a typo and the transpose in the sum term should be on the other element (so it is an inner product, and the dimensions are at least plausible). — htl, Feb 20 '21 at 09:56
@htl You can and should propose an edit to the post if you think there are typos there. — nbro, Feb 20 '21 at 10:30
Why did you delete this post? I was trying to write an answer, although that answer could not be satisfactory. In any case, I don't know when I will finish writing it. — nbro, Feb 20 '21 at 22:51
@nbro go ahead and put an answer. I posted on stats stack exchange to see if stats people had any intuition about this, that's why I deleted it, but I'm eagerly waiting to see what you write up. — user8714896, Feb 21 '21 at 00:03

Why does this formula $\sigma^2 + \frac{1}{T}\sum_{t=1}^Tf^{\hat{W_t}}(x)^Tf^{\hat{W_t}}(x_t)-E(y)^TE(y)$ approximate the variance?

0 Answers0