Arguably there's reason to believe it comes up with a plan when it is computing token propabilities, but it does not store it between tokens. I.e. it doesn't possess or "have" it. It simply comes up with a plan, emits a token, and entirely throws all its intermediate thoughts (including any plan) to start again from scratch on the next token.
So there's plenty of space in intermediate layers to store a plan between tokens without starting from scratch every time.
- the sufficient amount of information to do evolution of the system. The state of a pendulum is it's position and velocity (or momentum). If you take a single picture of a pendulum, you do not have a representation that lets you make predictions.
- information that is persisted through time. A stateful protocol is one where you need to know the history of the messages to understand what will happen next. (Or, analytically, it's enough to keep track of the sufficient state.) A procedure with some hidden state isn't a pure function. You can make it a pure function by making the state explicit.
I know nothing about how things work at that level, so these might not even be reasonable questions.
The inference logic of an LLM remains the same. There is no difference in outcomes between recalculating everything and caching. The only difference is in the amount of memory and computation required to do it.
The thing is that, fundamentally, an auto-regressive transformer is a model whose state grows linearly with each token without compression, which is what bestows them with (theoretical) perfect recall.