1

The maximum derivative of most of the currently existing activation functions is around 1. Can an activation function with derivatives higher than 1, say 1000 (a), cause exploding gradient problem? How about that with maximum derivative of 5 (b)? If (a) and (b) are compared, can one say having (b) is better than (a)?

JGM
  • 11
  • 1

0 Answers0