When does a batch normalization layer becomes active?

Question

Let us assume your dataset has $n$ training samples each of size $s$ and you divided them into $k$ batches for training. Then each batch has $n_k = \dfrac{n}{k}$ training samples.

Batch normalization can be applied to any input or hidden layer in a neural network. So, assume that I am applying batch normalization at every possible place I can.

Now, consider a particular batch normalization layer (say $b$) of a hidden layer $\ell$. Now, I am confused about the working frequency of $b$.

Will it be activated only after every $n_k - 1$ forward passes i.e, once per batch at the end of the batch? If no, then how $b$ calculates the mean and standard deviation for every forward pass while training if $n_k$ output vectors of $\ell$ are not available at that instant?

Will $b$ calculates the mean and standard deviated, for every forward pass, based on the outputs of $\ell$ that are calculated so far? If yes, then why it is called batch normalization?

To put it concisely, are batch normalization layers active for every iteration? If yes then how they are normalizing a "batch" of vectors?

You can check here which says

The mean and standard-deviation are calculated per-dimension over the mini-batches

If I understand your question correctly, I think this is related to the momentum which calculates the moving average, hence it can be applied at every iteration. [More Details](https://keras.io/api/layers/normalization_layers/batch_normalization/). — Yahya, Jul 31 '21 at 19:30
@Yahya Does the word "batch" here also need to be interpreted as the "collection" rather than collection of $\dfrac{n}{k}$ samples? — hanugm, Jul 31 '21 at 23:39
@Yahya you can see that "the layer normalizes its output using the mean and standard deviation of the **current batch of inputs**" this statement says that we use mean and standard deviation on a batch. Even we use moving averages, the number of samples we are considering for normalization depends on number of iterations occurred till now. So, batch in batch normalization also refers to the collection of vectors we encountered till now. — hanugm, Jul 31 '21 at 23:45
And it is also useful to check [here](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html) which says **The mean and standard-deviation are calculated per-dimension over the mini-batches**. The word mini-batches here imposes the requirement that at-least mini batch number of vectors should be available to proceed. Am i wrong? — hanugm, Jul 31 '21 at 23:47
@Yahya I am opining that the words batch and mini-batch are used loosely If they are considering moving averages at every forward pass. — hanugm, Jul 31 '21 at 23:49
As far as I understand it, it is always the **mini-batch** that the mean and STD are derived from. Yet, the mini-batch can be as large as the whole dataset, hence they are using the two terms (batch and mini-batch) interchangeably. It's even required statistically to have a relatively large mini-batch to get representative mean and STD. Moreover, at inference time, we need the mean and STD of the whole dataset, (continue in next comment). — Yahya, Aug 01 '21 at 12:54
but since we are using mini-batches, we calculate the **moving** average and STD at every call of the layer during training, this is valid regardless of the vectorization, ..well ...because it is a *moving* measure. Still, how many mini-batches should be involved to calculate the latter? Here we introduce the momentum to decide the history that should be involved, where 0.0 means calculate the *moving* average and STD only for the very last mini-batch (hence highly biased), and 1 is for the very first mini-batch. — Yahya, Aug 01 '21 at 12:55

When does a batch normalization layer becomes active?

0 Answers0

Linked