3

I'm interested in using the sigmoid (or tanh) activation function instead of RELU. I'm aware of RELU advantages on faster computation and no vanishing gradient problem. But about vanishing gradient, the main problem is about the backpropagation algorithm going to zero quickly if using sigmoid or tanh. So I would like to try to compensate this effect that affects deep layers with a variable learning rate for every layer, increasing the coefficient every time you go a layer deeper to compensate the vanishing gradient.

I have read about adaptive learning rate, but it seems to refer to a technique to change the learning rate on every epoch, I'm looking for a different learning rate for every layer, into any epoch.

  1. Based on your experience, do you think that is a good effort to try?

  2. Do you know some libraries I can use that already let you define the learning rate as a function and not a constant?

  3. If such function exists, it will be better to define a simple function lr=(a*n)*0.001 where n is layer number, and a a multiplier based on experience, of we will need the inverse of the activation function to compensate enough the gradient vanishing?

nbro
  • 39,006
  • 12
  • 98
  • 176
  • 2
    I see two possibilities: (a) copy and modify an existing optimizer like SGD or Adam to do what you want, (b) use SGD with lr=1 and multiply the gradients of each layer by whatever learning rate you want (this is mathematically equivalent to what you want). – Ricardo Magalhães Cruz Aug 06 '20 at 11:49
  • 1
    I believe you might run into issues related to errors inherent in numerical computation. That is, the calculation of the gradient for these "compensated" layers might become increasingly inaccurate and prevent learning from occurring. This would be mostly true in deep networks because the compensated learning rate might result in very large numbers multiplied by very small numbers (a known challenge in numerical computation). It would be interesting if someone had insights into this. – respectful Aug 06 '20 at 14:28
  • 1
    Alternative idea: normalize each layer's gradient and scale by normalized gradient multiplier `ngm`, which is somewhat similar to learning rate. – ShadowsInRain Aug 07 '20 at 18:37

0 Answers0