zlacker

This is wrong, intermediate activations are preserved when going forward.

replies(1): >>ACCoun+jt

>>lostms+(OP)
Within a single forward pass, but not from one emitted token to another.

replies(1): >>andy12+9b1

>>ACCoun+jt
What? No. The intermediate hidden states are preserved from one token to another. A token that is 100k tokens into the future will be able to look into the information of the present token's hidden state through the attention mechanism. This is why the KV cache is so big.

replies(1): >>ACCoun+Ab3

>>andy12+9b1
KV cache is just that: a cache.

The inference logic of an LLM remains the same. There is no difference in outcomes between recalculating everything and caching. The only difference is in the amount of memory and computation required to do it.

replies(1): >>andy12+6S3

>>ACCoun+Ab3
The same can be said about any recurrent network. To predict the token n+1 you could recalculate the hidden state up to token n, or reuse the hidden state of token n from the previous forward pass. The only difference is the amount of memory and computation.

The thing is that, fundamentally, an auto-regressive transformer is a model whose state grows linearly with each token without compression, which is what bestows them with (theoretical) perfect recall.