I have been looking for the answer in other questions but no one tackled that. I want to ask you how is the padding mask considered in the formula of attention?
The attention formula taking into account a causal mask is: $Attention(Q, K, V) = softmax(\frac{QK^{T} + CausalMask}{\sqrt{d_{k}}})V$
But how do we add the padding mask? The aim of a padding mask is to mask the padding positions as they're used just to make the batching feasible. But I don't know how this mask is added in the Attention formula.
Does it make sense if we do element-wise multiplication of the Attention matrix with a tensor of ones of shape (batch size, sequence length, $d_{model}$) and whatever sentence $s$ in that batch and wherever position $p$ is a padding token then the tensor[s, p, :] is zeros?
Thank you in advance for your help!