I know that there has been some discussion about this (e.g. here and here), but I can't seem to find consensus.
The crucial thing that I haven't seen mentioned in these discussions is that applying batch normalization before ReLU switches off half the activations, on average. This may not be desirable.
In other words, the effect of batch normalization before ReLU is more than just z-scaling activations.
On the other hand, applying batch normalization after ReLU may feel unnatural because the activations are necessarily non-negative, i.e. not normally distributed. Then again, there's also no guarantee that the activations are normally distributed before ReLU clipping.
I currently lean towards a preference to batch normalization after ReLU (which is also based on some empirical results).
What do you all think? Am I missing something here?