You're correct, the distinction matters. Autoregressive models have no hidden state between tokens, just the visible sequence. Every forward pass starts fresh from the tokens alone.But that's precisely why they need chain-of-thought: they're using the output sequence itself as their working memory. It's computationally universal but absurdly inefficient, like having amnesia between every word and needing to re-read everything you've written.
https://thinks.lol/2025/01/memory-makes-computation-universa...