In the transformer (or GPT/decoder only), at the end of the decoder blocks but before the final linear layer you have X vectors (for the X tokens at the input of the decoder). We then want to compute the probabilities for the next token of the sequence - what do we then feed to the linear layer? Is it the last embedding corresponding to the hidden state of the last token in the input sequence?
I've seen some tutorials on youtube on how to make mini gpts but I never quite understood why they feed the entire X vectors/hidden states at the end of the decoder blocks to the linear layer and not just the last vector/hidden state... Wouldn't you have X probability distributions when in reality you only want one? And if we do want the X probability distributions then wouldn't we be completely missing the point of the masked self attention since we would be trying to predict words that are already in the input sequence, so essentially "cheating"?