zlacker

[parent] [thread] 0 comments
1. inciam+(OP)[view] [source] 2025-07-07 12:40:32
You're correct, the distinction matters. Autoregressive models have no hidden state between tokens, just the visible sequence. Every forward pass starts fresh from the tokens alone.But that's precisely why they need chain-of-thought: they're using the output sequence itself as their working memory. It's computationally universal but absurdly inefficient, like having amnesia between every word and needing to re-read everything you've written.https://thinks.lol/2025/01/memory-makes-computation-universa...
[go to top]