9

There is a popular story regarding the back-of-the-envelope calculation performed by a British physicist named G. I. Taylor. He used dimensional analysis to estimate the power released by the explosion of a nuclear bomb, simply by analyzing a picture that was released in a magazine at the time.

I believe many of you know some nice back-of-the-envelope calculations performed in machine learning (more specifically neural networks). Can you please share them?

nbro
  • 39,006
  • 12
  • 98
  • 176
Charles
  • 281
  • 2
  • 6
  • 1
    Yes, this question is not fully clear. [This answer](https://ai.stackexchange.com/a/6090/2444) attempts to give you a "back-of-the-envelope" calculation (as defined by the linked Wikipedia article), i.e. a _rule of thumb to approximately calculate_ something, although I don't think that answer is clear (or even correct). Is this type of calculation that you were looking for, or were you looking for something else? Please, edit your post to clarify that, otherwise, I will close it as "needs details or clarity". – nbro Oct 04 '20 at 10:46
  • I think the question is clear, and so do people who have provided answers. I think both submitted answers look like what I might expect. Frankly, I modeled this question based off of https://mathoverflow.net/questions/8846/proofs-without-words. So, if this question is closed, then I would expect that question also be closed. I'm open to improving the question somehow if you have suggestions. – Charles Oct 05 '20 at 13:37

2 Answers2

0

I think a nice back-of-the envelope calculation is the intuition for exploding/vanishing gradients in RNNs:

Simplifications

  • diagonalisable weights $U$ and $W$
  • no non-linearities
  • 1 layer

This gives a hidden state $h_t$ at timestep $t$ for input $x_t$: $h_t = W\cdot h_{t-1} + U\cdot x_t$

Let $L_t$ be the loss at timestep $t$ and the total loss $L = \sum_t L_t$. Then (eq. 3 -> 5 in the paper)

$$ \frac{\partial L_t}{\partial W} \sim = \sum_{k=1}^{t} \frac{\partial L_t}{\partial h_t} \frac{\partial h_t}{\partial h_k} \frac{\partial h_k}{\partial W} = \sum_{k=1}^{t}\frac{\partial h_t}{\partial h_k}\times\alpha_{t, k} $$

Let's not care about terms regrouped in $\alpha_{t, k}$:

$$ \frac{\partial h_t}{\partial h_k} = \prod_{k<i\leq t} \frac{\partial h_i}{\partial h_{i-1}} = \prod_{k<i\leq t} W = \prod_{k<i\leq t} PDP^\top = PD^{t-k}P^{\top} $$

So you can easily see$^1$ that if the eigen values of $W$ (in the diagonal matrix $D$) are larger than $1$, the gradient will explode with time, and if they are smaller than $1$, it will vanish.

More detailed derivations in On the difficulty of training recurrent neural networks


$^1$ remember $\lim_{n \to +\infty}|x^n| = +\infty$ if $|x|>1$ and $=0$ for $|x| < 1$

ted
  • 266
  • 1
  • 4
  • 1
    You say "you can easily see that", but I cannot easily see it (so I guess many others can't easily see it too). Maybe you should provide more intuition behind that and, more importantly, define your matrices and other symbols. – nbro Oct 04 '20 at 10:51
  • I didn't think I had to explain that x^n goes to inf for |x| > 1 and to 0 for |x| < 1 as n goes to inf. Also notations are standard rnn notations and there is a link to the paper for the details you asked for. – ted Oct 05 '20 at 05:29
0

I have one to share. This is no formula, but a general thing I have noticed.

The number of neurons + neurons should be proportionate, in some way, to the complexity of the classification.

Although this is fairly basic and widely known, it has helped me in many times to consider one thing: how many at a minimum does it need?

FreezePhoenix
  • 422
  • 3
  • 20
  • It looks like my question may be a bit vague. I’ll try to add more constraints this afternoon if I have time. – Charles Apr 17 '18 at 15:41
  • 3
    What does "The number of neurons + neurons" mean? Maybe you mean the number of parameters is proportional to the complexity of the neural network? – nbro Oct 04 '20 at 10:49