I'm interested in using the sigmoid (or tanh) activation function instead of RELU. I'm aware of RELU advantages on faster computation and no vanishing gradient problem. But about vanishing gradient, the main problem is about the backpropagation algorithm going to zero quickly if using sigmoid or tanh. So I would like to try to compensate this effect that affects deep layers with a variable learning rate
for every layer, increasing the coefficient every time you go a layer deeper to compensate the vanishing gradient.
I have read about adaptive learning rate, but it seems to refer to a technique to change the learning rate on every epoch, I'm looking for a different learning rate for every layer, into any epoch.
Based on your experience, do you think that is a good effort to try?
Do you know some libraries I can use that already let you define the learning rate as a function and not a constant?
If such function exists, it will be better to define a simple function
lr=(a*n)*0.001
wheren
is layer number, anda
a multiplier based on experience, of we will need the inverse of the activation function to compensate enough the gradient vanishing?