I have constructed a CNN that utilizes max-pooling layers. I have found with these layers that, should I remove them, my network performs ideally with every output and gradient at each layer having a variance close to 1. However, if they are included, the variance skyrockets.
This makes sense, of course, as a max-pooling layer takes the maximum of an area, which must incur a positive bias as larger numbers are chosen.
I would just like to know what methods are typically used to combat this.