4

I know that there has been some discussion about this (e.g. here and here), but I can't seem to find consensus.

The crucial thing that I haven't seen mentioned in these discussions is that applying batch normalization before ReLU switches off half the activations, on average. This may not be desirable.

In other words, the effect of batch normalization before ReLU is more than just z-scaling activations.

On the other hand, applying batch normalization after ReLU may feel unnatural because the activations are necessarily non-negative, i.e. not normally distributed. Then again, there's also no guarantee that the activations are normally distributed before ReLU clipping.

I currently lean towards a preference to batch normalization after ReLU (which is also based on some empirical results).

What do you all think? Am I missing something here?

nbro
  • 39,006
  • 12
  • 98
  • 176
Kris
  • 171
  • 5
  • 2
    Like you said, there has been a lot of discussion but no consensus. This surely means this is not a crucial decision, or that the best choice depends on the given problem. I would suggest to compare both ways if possible. I did this in one case and I think the "bad" way (BN-RELU) was a bit better. But it may be the opposite for other architectures and problems! – Pablo Jul 01 '20 at 09:15

0 Answers0