For a transformer decoder, how exactly are K, Q, and V for each decoding step?

Question

Assume my input prompt is "today is a" (good day).

At t= 0 (generation step 0): K, Q, and V are the projections of the sequence ("today is a") Then say the next token generated is "good".

At t=1 (generation step 1): Which one is true:

K, Q, and V are the projections of the sequence ("today is a good")
K, Q, are the projections of the sequence ("today is a"), and V is the projection of the sequence ("good")?

Please see if [the explanation](https://ai.stackexchange.com/questions/40179/how-does-the-decoder-only-transformer-architecture-work) here helps you in any way. In short, during the forward pass, K, Q, and V are projections of the input sequence ("today is a"). The output of the forward pass is "good". — Robin van Hoorn, May 09 '23 at 07:03
@RobinvanHoorn, I see, and thanks. My question is the next next time step (t+2), so now your entire sequence is "today is a good", do you just repeat the same thing (with K, Q, V as the projections of the "today is a good" sequence?) — wrek, May 10 '23 at 02:33
your question was (and the question still only mentions) `t=1`. If you again want to predict a next word, then K,Q and V are again projections of the input sequence ("today is a good") and then the output becomes (i.e.) "day". — Robin van Hoorn, May 10 '23 at 09:50

score 2 · Accepted Answer · answered May 09 '23 at 12:52

(This type of) autoregressive LLM always works by predicting one next token based on a series of previous tokens. First you run the model with input "today is a" and the prediction is "good". Then you run the model with input "today is a good" and the prediction is "day", and so on. Each token is predicted by running the entire model from start to finish on its previous input.

For a transformer decoder, how exactly are K, Q, and V for each decoding step?

1 Answers1