I answered another question here about the mean prediction of a GP, but I have a hard time coming up with an intuitive explanation of the variance prediction of a GP.
Thew specific equation that I am speaking of is equation 2.26 in the Gaussian process book...
$$ \mathbb{V}[f_*] = k(x_*, x_*) - k_{x*}^TK_{xx}^{-1}k_{x*} $$
I have a number of questions about this...
if $k(x_*, x_*)$ is the result of the kernel function with a single point $x_*$, then won't this value always be 1 (assuming an RBF kernel) since the kernel will give 1 for a covariance with itself ($k(x, x) =\exp\{-\frac{1}{2}|| x - x ||^2\}$)
If the kernel value $k(x_*, x_*)$ is indeed one for any single arbitrary point, then how can I interpret the last multiplication on the RHS? $K_{xx}^{-1}k_{x*}$ is the solution to $Ax = b$, which is the vector which $K_{xx}$ projects into $k_{x*}$, but then my intuition breaks down and I cannot explain anymore.
If the kernel value $k(x_*, x_*)$ is indeed one for any single arbitrary point, then can we view the whole term as the prior variance being reduced by the some sort of similarity between the test point and the training points?
Is it every possible for this variance to be greater than 1? Or is the prior variance of 1 seen as the maximum, which can only be reduced by observing more data?