In this tensorflow article, the comments in the code say that MHA should output with one of the dimensions being the sequence length of the query/key. However, that means that the second MHA in the decoder layer should output something with one of the dimensions being the input sequence length, but clearly it should actually be the output sequence length! From all that I have read on transformers, it seems that the output of the left side of the SHA should be a matrix with dimensions q_seq_length x q_seq_length, and the output of the right side of the SHA should be v_seq_length x d_model. These matrices can't even multiply when using the second MHA in the decoder to incorporate the encoder output! Please help. I would appreciate a clear-cut explanation. Thanks
1 Answers
Aha, I understand now! In the paper, the diagram for SHA has its inputs in the order Q, K, V, while the diagram for MHS has its inputs in the order V, K, Q! In the grand diagram for the entire net, I thought the arrows for the inputs were in the order for SHA, but they are actually in the order for MHA. It is a bit confusing that when changing to the MHA diagram, they decided to swap the order of inputs in the paper, but it all makes sense now. output_seq x d_q multiplies d_q x input_seq to get output_seq x input_seq, which finally multiplies with input_seq x d_v to get output_seq x d_v, which then concatenates to output_seq x hd_v, so after the final linear layer it becomes output_seq x d_model, the standard in the decoder layer. Hopefully, anyone else who mixed up orders from the diagram will understand now.

- 111
- 3