zlacker

[parent] [thread] 0 comments
1. andy12+(OP)[view] [source] 2025-07-08 15:57:35
The same can be said about any recurrent network. To predict the token n+1 you could recalculate the hidden state up to token n, or reuse the hidden state of token n from the previous forward pass. The only difference is the amount of memory and computation.

The thing is that, fundamentally, an auto-regressive transformer is a model whose state grows linearly with each token without compression, which is what bestows them with (theoretical) perfect recall.

[go to top]