5

Raul Rojas' Neural Networks A Systematic Introduction, section 8.1.2 relates off-line backpropagation and on-line backpropagation with Gauss-Jacobi and Gauss-Seidel methods for finding the intersection of two lines.

What I can't understand is how the iterations of on-line backpropagation are perpendicular to the (current) constraint. More specifically, how is $\frac12(x_1w_1 + x_2w_2 - y)^2$'s gradient, $(x_1,x_2)$, normal to the constraint $x_1w_1 + x_2w_2 = y$?

EmmanuelMess
  • 207
  • 3
  • 14

2 Answers2

1

Answer by Theo Bandit at maths stackexchange

If you choose two points $(w_1, w_2), (v_1, v_2)$ along this line, then $$(x_1, x_2) \cdot ((w_1, w_2) - (v_1, v_2)) = x_1 w_1 + x_2 w_2 - (x_1 v_1 + x_2 v_2) = y - y = 0.$$ That is, the direction $(x_1, x_2)$ is perpendicular to any vector lying along the line, i.e. $(x_1, x_2)$ is normal to the line.

EmmanuelMess
  • 207
  • 3
  • 14
1

The equation $\frac12(x_1w_1 + x_2w_2 - y)^2$ is called the $Error (E)$ (assuming $y$ to be continuous which is not the case in case of classifiers). If you write this equation in Physics or Maths it represents a family of curves in 4D (the curves are continuous but for visualisation we will assume it to be a family of curves).

Here is a representative equation of what it would have looked like had the error been $\frac12(x_1w_1 - y)^2$ a 3D curve.

enter image description here

This is a scalar quantity which represents the value of error at different places for different values of $w1$ and $w2$. Now gradient of a scalar is defined as $\nabla F$, where $F$ is a scalar, on doing this operation you get a vector, which is perpendicular to the equi-potential or more suitably equi-error surface, i.e. if you trace all the points which give the same error, you will get a curve, and its gradient at any point is the vector perpendicular to the curve at that given point. There are many proofs for this but here is a very simple and nice proof.

Now lets look at the equation of the constraint $x_1w_1 + x_2w_2 = y$. In case of a 3D error curve, the constraint is giving us a plane which is parallel to the tangential plane of the equi-error surface at a given point. You can look at this method of how to find tangential planes and derive the plane yourself, where $z = Error(E)$ and $w1$ and $y$ are your $x$ and $y$.

Thus it is quite clear that the gradient will be perpendicular to the constraint, and this is the reason we use gradients because according to mathematics if you move in a direction perpendicular to an equi-potential surface you get the maximum change than any other direction for same $dl$ movement.

I would highly suggest you check out these videos on gradient from Khan academy. This will hopefully give you a more intuitive understanding of why we do what we do in Neural Networks.